FreeDiscovery Features

Fully functional web service to hide the details of calling a complex machine learning library to perform


Document categorization

Put a set of documents into categories, e.g., “medical” or “legal,” etc. If you have some documents categorized, you can use FreeDiscovery to teach the machine to categorize new ones.

Document clustering

Take a set of documents and put them into natural clusters; FreeDiscovery will organize them into lumps of related documents. Many clustering algorithms require you to figure out in advance how many clusters exist. FreeDiscovery doesn’t; it simply clusters the collection and tries to come up with a logical name for each cluster. You can use this if you get a big collection and you want to know essentially what’s in it.

Duplicate detection

If you have a group of documents with some duplicates, duplicate detection will identify them, and it does so in a smart way, i.e., it doesn’t have to be a 100% duplicate.


E-mail threading

If you have a group of e-mails that includes various conversations, this algorithm will find the conversations

Algorithms available for document categorization


Support vector machine

This algorithm works by looking for a separating hyperplane that separates the documents into one category versus another. It’s designed for binary classification—where there are only two categories—but it’s easy to extend to multiple categories by simply comparing each category to all other categories. The phrase “separating hyperplane” might sound intimidating, but it’s no big deal. If a document had just three dimensions, it would be a plane; however, each distinct word in a document is often a dimension, and so you have a multidimensional space, i.e., a “hyperplane,” which is simply a plane that exists in more than three dimensions.

K-nearest neighbor

K-nearest neighbor refers to finding the closest documents, again, in multidimensional space. So K is just the k closest documents.

Random forests

Decision tree algorithms find decision trees that allow you to make decisions. Think of it as generating explicit rules to say that if a certain word appears, then the document belongs in a certain category.

Recurrent neural networks

Neural networks are all the rage right now. This algorithm models the brain by forming “neurons” and then training them with the exemplars. As learning progresses, weights for each neuron are adjusted, resulting in an algorithm that will identify categories.

Text representations available


Classic “bag of words” – each word is a feature

This represents a document with a bag of words. The order of the words doesn’t matter, and so it’s called a “bag.” For categorization purposes, many algorithms are 90% accurate without regard to word order. Building in word-order rules makes building an efficient algorithm much more complicated and difficult.

Latent semantic indexing

This refers to identifying the “latent” concepts in a document, with the idea is that there are deeper semantics beyond the words. It does this by starting with a M x N matrix, where M is the number of documents and N is the number of distinct words. A matrix decomposition algorithm is used to represent the matrix in fewer dimensions, and the theory is that those new dimensions are “latent concepts.” This approach is good at matching, say, a document with the word “bomb” with another with the word “explosive” without having to use an explicit thesaurus.

Contact Us