Request A Free Discovery Demo

open source software for e-discovery & machine learning

Request a FREE demo, Just fill in your email address below:

Download the FREE White Paper: Effectiveness Results for Popular e-Discovery Algorithms

Originally Presented at the 2017 International Conference on Artificial Intelligence and Law at King’s College in London.

Free, open-source, fully functional Web service to hide the details of calling a complex machine-learning library to perform:

Document categorization
Put a set of documents into categories, e.g., “medical” or “legal,” etc. If you have some documents categorized, you can use FreeDiscovery to teach the machine to categorize new ones.
Document clustering
Take a set of documents and put them into natural clusters; FreeDiscovery will organize them into lumps of related documents. FreeDiscovery clusters the collection and comes up with a logical name for each cluster. You can use this if you get a big collection and you want to know essentially what’s in it.
Duplicate detection
If you have a group of documents with some duplicates, duplicate detection will identify them, and it does so in a smart way, i.e., it doesn’t have to be a 100% duplicate.
E-mail threading
If you have a group of e-mails that includes various conversations, this algorithm will find the conversations
Support vector machine
This algorithm works by looking for a separating hyperplane that separates the documents into one category versus another. It’s designed for binary classification—where there are only two categories—but it’s easy to extend to multiple categories by simply comparing each category to all other categories.
K-nearest neighbor
K-nearest neighbor refers to finding the closest documents, again, in multidimensional space. So K is just the k closest documents.
Random forests
Decision tree algorithms find decision trees that allow you to make decisions. Think of it as generating explicit rules to say that if a certain word appears, then the document belongs in a certain category.
Recurrent neural networks
Neural networks are all the rage right now. This algorithm models the brain by forming “neurons” and then training them with the exemplars. As learning progresses, weights for each neuron are adjusted, resulting in an algorithm that will identify categories.
Latent semantic indexing
This refers to identifying the “latent” concepts in a document, with the idea is that there are deeper semantics beyond the words. It does this by starting with a M x N matrix, where M is the number of documents and N is the number of distinct words. A matrix decomposition algorithm is used to represent the matrix in fewer dimensions.
Classic “bag of words” – each word is a feature
This represents a document with a bag of words. The order of the words doesn’t matter, and so it’s called a “bag.” For categorization purposes, many algorithms are 90% accurate without regard to word order. Building in word-order rules makes building an efficient algorithm much more complicated and difficult.