FreeDiscovery – Data Ingestion
Format of processed files
FreeDiscovery can process document collections, where files are stored within a folder hierarchy, with the assumption that each separate file corresponds to a document.
For ingesting a custom dataset, it is necessary is to place the data collection on the FreeDiscovery server (or within the folder mounted inside the Docker container) and provide either the name of the folder where the data is located (
data_dir) or individual paths to every file in the collection (
the data_dir and file_path should be absolute paths.
Each document in the collection is given a unique numerical
internal_id that is used to identify documents internally. This field is however not exposed by the REST API. Instead, the following fields can also be used for indexing,
file_path: the absolute path to the file. Note that when FreeDiscovery is provided with a list of
data_dirwill be recomputed as the longest common path.
document_id: an external numeric document id provided in the
document_idtogether with a
rendition_id, where the latter is also provided in the
POST /api/v0/feature-extraction/<dataset-id>processing step
To use a field (or a group of fields) as a index, it must be unique, meaning that duplicates are not supported. The only field that is unique by construction is
internal_id, all the rest being user provided. Additional information regarding data ingestion and indexing can be found in the REST_data_ingestion example. The mapping between different index fields is provided by
The training of the categorization model also uses a similar mechanism to identify training set documents.
the different fields above can always be associated to the processed documents, however attempting to index (e.g. select) a subset of documents using a field (or a group of fields) with duplicates will result in an error.
A few example datasets (subsets of the TREC Legal 2009 collection) can be automatically downloaded by FreeDiscovery with the following command,
These datasets are in particular used in the examples and contain the ground truth classification labels.