Frequently Asked Questions
What is FreeDiscovery?
FreeDiscovery is an open-source Web service tier that allows GUI developers to access complex analytics without having to deal with all the nuances of a machine learning/artificial intelligence package. Key features include document categorization (the heart of e-discovery, hence the name), duplicate detection, document clustering, and e-mail threading. In the case of document categorization, several popular algorithms are available (e.g., latent semantic indexing, support vector machines, random forests, logistic regression, recurrent neural networks). For duplicate detection, a well documented and accepted algorithm is implemented that is extremely efficient and scalable. Document clustering has a few key algorithms and one that doesn’t require the user to know how many clusters must be done in advance. E-mail threading is the least mature, but it does implement a reasonably well-accepted algorithm that can identify threads within groups of e-mails.
Why do we need FreeDiscovery? Can’t we just call the underlying algorithms without a Web service?
Yes, but let’s say you’re writing code in Java or .NET; calling a library based on python is non-trivial, but if it’s wrapped as Web service, then suddenly it’s no different from calling any other service (e.g., currency conversion, etc.).
Does it actually work?
Yes, we’ve tested it on a large (700,000 document) collection with known ground truth and compared various algorithms using FreeDiscovery. We did this because we wanted to know which algorithm worked best on this dataset, but we also wanted to exercise FreeDiscovery.
Does anyone use Free Discovery?
The development of FreeDiscovery was supported initially by a very large e-discovery vendor. Parts of FreeDiscovery are now embedded in their core user interface. This has enabled this vendor to deploy numerous algorithms very quickly.
Is FreeDiscovery limited to just the e-discovery arena?
It works for e-discovery, but the underlying algorithms are generic, so anyone who needs document clustering, categorization, and duplicate detection can benefit.
Does it only work on documents?
No, categorization is done as a straight-up machine learning problem, so typical sets of structured features and labels can be used. Applications such as identifying which ads should go on which Web pages, medical diagnoses (what’s the right diagnosis for a given patient) are ideal applications for the machine learning algorithms enabled via FreeDiscovery.
Is FreeDiscovery scalable to very large collections?
Right now, FreeDiscovery runs as a Web service on a single server. It’ll run fine in the AWS cloud on large instance sizes, but currently you can only have one instance. Without any code changes, you can spin up multiple instances and scale horizontally. The algorithms selected for most of FreeDiscovery are scalable to extremely large sizes. Clearly, more work can be done in the scalability arena, but we suspect that, for initial testing, the software will be sufficient. In the future, we want FreeDiscovery to be able to spread machine learning problems across multiple machines. This can be done easily for some algorithms (e.g., random forests), but not so easily for things such as Latent Semantic Indexing. So the answer may never be, “Yes, we scale to petabytes with no problem,” but may be a mix of “depending on what you want to do, we may be able to do this in FreeDiscovery.”
What does it cost to run FreeDiscovery?
There is no charge for the software.