Fully functional web service to hide the details of calling a complex machine learning library to perform
Put a set of documents into categories, e.g., “medical” or “legal,” etc. If you have some documents categorized, you can use FreeDiscovery to teach the machine to categorize new ones.
Take a set of documents and put them into natural clusters; FreeDiscovery will organize them into lumps of related documents. Many clustering algorithms require you to figure out in advance how many clusters exist. FreeDiscovery doesn’t; it simply clusters the collection and tries to come up with a logical name for each cluster. You can use this if you get a big collection and you want to know essentially what’s in it.
If you have a group of documents with some duplicates, duplicate detection will identify them, and it does so in a smart way, i.e., it doesn’t have to be a 100% duplicate.
If you have a group of e-mails that includes various conversations, this algorithm will find the conversations
Algorithms available for document categorization
Support vector machine
This algorithm works by looking for a separating hyperplane that separates the documents into one category versus another. It’s designed for binary classification—where there are only two categories—but it’s easy to extend to multiple categories by simply comparing each category to all other categories. The phrase “separating hyperplane” might sound intimidating, but it’s no big deal. If a document had just three dimensions, it would be a plane; however, each distinct word in a document is often a dimension, and so you have a multidimensional space, i.e., a “hyperplane,” which is simply a plane that exists in more than three dimensions.
K-nearest neighbor refers to finding the closest documents, again, in multidimensional space. So K is just the k closest documents.
Decision tree algorithms find decision trees that allow you to make decisions. Think of it as generating explicit rules to say that if a certain word appears, then the document belongs in a certain category.
Recurrent neural networks
Neural networks are all the rage right now. This algorithm models the brain by forming “neurons” and then training them with the exemplars. As learning progresses, weights for each neuron are adjusted, resulting in an algorithm that will identify categories.
Text representations available
Classic “bag of words” – each word is a feature
This represents a document with a bag of words. The order of the words doesn’t matter, and so it’s called a “bag.” For categorization purposes, many algorithms are 90% accurate without regard to word order. Building in word-order rules makes building an efficient algorithm much more complicated and difficult.
Latent semantic indexing
This refers to identifying the “latent” concepts in a document, with the idea is that there are deeper semantics beyond the words. It does this by starting with a M x N matrix, where M is the number of documents and N is the number of distinct words. A matrix decomposition algorithm is used to represent the matrix in fewer dimensions, and the theory is that those new dimensions are “latent concepts.” This approach is good at matching, say, a document with the word “bomb” with another with the word “explosive” without having to use an explicit thesaurus.