FreeDiscovery

Version: v0

 

Schemes:

Summary

Path Operation Description
/api/v0/categorization/ GET
POST
/api/v0/categorization/{mid} DELETE
GET
/api/v0/categorization/{mid}/predict GET
/api/v0/clustering/birch POST
/api/v0/clustering/dbscan POST
/api/v0/clustering/k-mean/ POST
/api/v0/clustering/{method}/{mid} DELETE
GET
/api/v0/duplicate-detection/ POST
/api/v0/duplicate-detection/{mid} DELETE
GET
/api/v0/email-threading/{mid} DELETE
GET
/api/v0/example-dataset/{name} GET
/api/v0/feature-extraction GET
POST
/api/v0/feature-extraction/{dsid} DELETE
GET
POST
/api/v0/feature-extraction/{dsid}/append POST
/api/v0/feature-extraction/{dsid}/delete POST
/api/v0/feature-extraction/{dsid}/id-mapping POST
/api/v0/lsi/ GET
POST
/api/v0/lsi/{mid} DELETE
GET
/api/v0/metrics/categorization POST
/api/v0/metrics/clustering POST
/api/v0/metrics/duplicate-detection POST
/api/v0/search/ POST
/api/v0/stop-words/ POST
/api/v0/stop-words/{name} DELETE
GET

Paths

GET /api/v0/categorization/

List existing categorization models

default

object

method:

string

options:

string

POST /api/v0/categorization/

Build the categorization ML model

The option use_hashing=True must be set for the feature extraction. Recommended options also include, weighting="ntc".

Parameters

  • parent_id: dataset_id or lsi_id
  • data: a list of dict which have a category field and one or several fields that can be used for indexing, such as document_id and optionally rendition_id.
  • method: classification algorithm to use (default: LogisticRegression),
    • “LogisticRegression”: LogisticRegression
    • “LinearSVC”: Linear SVM,
    • “NearestNeighbor”: nearest neighbor classifier (requires LSI)
  • cv: binary, if true optimal parameters of the ML model are determined by cross-validation over 5 stratified K-folds (default False).
  • training_scores: binary, compute the efficiency scores on the training dataset. This would make computations much slower for NearestNeighbors (default False).

 

cv:

boolean

data:

object[]

object

category:

string

document_id:

integer (int32)

render_id:

integer (int32)

method:

string

parent_id:

string

training_scores:

boolean

default
id:

string

training_scores:

object

average_precision:

number

f1:

number

precision:

number

recall:

number

recall_at_20p:

number

roc_auc:

number

DELETE /api/v0/categorization/{mid}

Delete the categorization model

mid path string
default

GET /api/v0/categorization/{mid}

Load categorization model parameters

mid path string
default
method:

string

options:

string

GET /api/v0/categorization/{mid}/predict

Predict document categorization with a previously trained model

Parameters

  • max_result_categories : the maximum number of categories in the results
  • sort_by : if provided and not None, the field used for sorting results. Valid values are or any of the ingested category names.
  • sort_order: the sort order (if applicable), one of [‘ascending’, ‘descending’]
  • max_results : return only the first max_results documents. If max_results <= 0 all documents are returned.
  • ml_output : type of the output in [‘decision_function’, ‘probability’], only affects ML methods.
  • metric : The similarity returned by nearest neighbor classifier in [‘cosine’, ‘jaccard’, ‘cosine-positive’].
  • min_score : filter out results below a similarity threashold
  • subset: apply prediction to a document subset. Must be one of [‘all’, ‘train’, ‘test’]. Default: ‘test’.
  • subset_document_id: apply prediction to a subset of document_id.
  • batch_id: retrieve a given subset of scores (-1 to retrieve all). Default: 0
  • batch_size: the number of document scores retrieved per batch. Default: 10000

 

batch_id:

integer (int32)

batch_size:

integer (int32)

10000

max_result_categories:

integer (int32)

1

max_results:

integer (int32)

metric:

string

cosine

min_score:

number

-1

ml_output:

string

probability

sort_by:

string

score

sort_order:

string

descending

subset:

string

test

subset_document_id:

integer[]

integer (int32)

mid path string
default
data:

object[]

object

document_id:

integer (int32)

render_id:

integer (int32)

scores:

object[]

object

category:

string

document_id:

integer (int32)

render_id:

integer (int32)

score:

number

pagination:

object

batch_id:

integer (int32)

batch_id_last:

integer (int32)

current_response_count:

integer (int32)

total_response_count:

integer (int32)

POST /api/v0/clustering/birch

Compute birch clustering

The option use_hashing=False must be set for the feature extraction. Recommended options for data ingestion also include, ntc.

Parameters

  • parent_id: dataset_id or lsi_id
  • n_clusters: the number of clusters or -1 to use hierarchical clustering (default: -1)
  • min_similarity: The radius of the subcluster obtained by merging a new sample and the closest subcluster should be lesser than the threshold. Otherwise a new subcluster is started. See sklearn.cluster.Birch. Increasing this value would increase the hierarchical tree depth (and the number of clusters).
  • branching_factor: Maximum number of CF subclusters in each node. If a new samples enters such that the number of subclusters exceed the branching_factor then the node has to be split. The corresponding parent also has to be split and if the number of subclusters in the parent is greater than the branching factor, then it has to be split recursively. Decreasing this value would increase the number of clusters.
  • max_tree_depth : Maximum hierarchy depth (only applicable when n_clusters=-1)
  • metric : The similarity returned by nearest neighbor classifier in [‘cosine’, ‘jaccard’, ‘cosine-positive’].

 

branching_factor:

integer (int32)

20

max_tree_depth:

integer (int32)

metric:

string

cosine

min_similarity:

number

0.5

n_clusters:

integer (int32)

-1

parent_id:

string

default
id:

string

POST /api/v0/clustering/dbscan

Compute clustering (DBSCAN)

The option use_hashing=False must be set for the feature extraction. Recommended options for the data ingestion also include, weighting="ntc".

Parameters

  • parent_id: dataset_id or lsi_id
  • min_similarity: The radius of the subcluster obtained by merging a new sample and the closest subcluster should be lesser than the threshold. Otherwise a new subcluster is started. See sklearn.cluster.Birch
  • metric : The similarity returned by nearest neighbor classifier in [‘cosine’, ‘jaccard’, ‘cosine-positive’].
  • min_samples: (optional) int The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

 

metric:

string

cosine

min_samples:

integer (int32)

10

min_similarity:

number

0.5

parent_id:

string

default
id:

string

POST /api/v0/clustering/k-mean/

Compute K-mean clustering

The option use_hashing=False must be set for the feature extraction. Recommended options for feature extraction include, weighting="ntc".

Parameters

  • parent_id: dataset_id or lsi_id
  • n_clusters: the number of clusters
  • metric : The similarity returned by nearest neighbor classifier in [‘cosine’, ‘jaccard’, ‘cosine-positive’].

 

metric:

string

cosine

n_clusters:

integer (int32)

150

parent_id:

string

default
id:

string

DELETE /api/v0/clustering/{method}/{mid}

Delete a clustering model

mid path string
method path string
default

GET /api/v0/clustering/{method}/{mid}

Compute cluster labels

Parameters

  • n_top_words: keep only most relevant n_top_words words
  • return_optimal_sampling : Instead of cluster results, the optimal sampling results will be returned (with no cluster labels). This option is only valid with Birch algorithm. Note that optimal sampling cannot return more samples than the subclusters in the birch clustering results (default: false)
  • sampling_min_similarity : Similarity threashold used by smart sampling. Decreasing this value would result in more sampled documents. Default: 1.0 (i.e. use the full cluster hierarichy).
  • sampling_min_coverage : Minimal coverage requirement in range. Increasing this value would result in a larger number of samples. (default: 0.9)

 

n_top_words:

integer (int32)

5

return_optimal_sampling:

boolean

sampling_min_coverage:

number

0.9

sampling_min_similarity:

number

1

mid path string
method path string
default
data:

object[]

object

children:

integer[]

integer (int32)

cluster_depth:

integer (int32)

cluster_id:

integer (int32)

cluster_label:

string

cluster_similarity:

number

cluster_size:

integer (int32)

documents:

object[]

object

document_id:

integer (int32)

render_id:

integer (int32)

similarity:

number

POST /api/v0/duplicate-detection/

Compute near duplicates

Parameters

  • parent_id: dataset_id or lsi_id
  • method: str, default=’simhash’ Method used for duplicate detection. One of “simhash”, “i-match”

 

method:

string

simhash

parent_id:

string

default
id:

string

DELETE /api/v0/duplicate-detection/{mid}

mid path string
default

GET /api/v0/duplicate-detection/{mid}

Query duplicates

Parameters

  • distance : int, default=2 Maximum number of differnet bits in the simhash (Simhash method only)
  • n_rand_lexicons : int, default=1 number of random lexicons used for duplicate detection (I-Match method only)
  • rand_lexicon_ratio : float, default=0.7 ratio of the vocabulary used in random lexicons (I-Match method only)
  • metric : The similarity returned by nearest neighbor classifier in [‘cosine’, ‘jaccard’, ‘cosine-positive’].

 

distance:

integer (int32)

metric:

string

cosine

n_rand_lexicons:

integer (int32)

rand_lexicon_ratio:

number

mid path string
default
data:

object[]

object

children:

integer[]

integer (int32)

cluster_depth:

integer (int32)

cluster_id:

integer (int32)

cluster_label:

string

cluster_similarity:

number

cluster_size:

integer (int32)

documents:

object[]

object

document_id:

integer (int32)

render_id:

integer (int32)

similarity:

number

DELETE /api/v0/email-threading/{mid}

Delete a processed dataset

mid path string
default

GET /api/v0/email-threading/{mid}

Get email threading parameters

mid path string
default
group_by_subject:

boolean

GET /api/v0/example-dataset/{name}

Download a benchmark dataset.

The currently supported datasets are listed below,

1. TREC 2009 legal collection

- `treclegal09_2k_subset` : 2 400 documents, 2 MB
- `treclegal09_20k_subset` : 20 000 documents, 30 MB
- `treclegal09_37k_subset` : 37 000 documents, 55 MB
- `treclegal09` : 700 000 documents, 1.2 GB

The ground truth files for categorization are adapted from TAR Toolkit.

2. Fedora mailing list (2009-2009)
- `fedora_ml_3k_subset`

3. The 20 newsgoups dataset
- `20_newsgroups_3categories`: only the categories

If you encounter any issues for downloads with this function,
you can also manually download and extract the required dataset to
``cache_dir`` (the download url is ``http://r0h.eu/d/<name>.tar.gz``),
then re-run this function to get the required metadata.

 

n_categories:

integer (int32)

2

name path string
default
dataset:

object[]

object

category:

string

document_id:

integer (int32)

file_path:

string

internal_id:

integer (int32)

render_id:

integer (int32)

metadata:

object

data_dir:

string

name:

string

training_set:

object[]

object

category:

string

document_id:

integer (int32)

file_path:

string

internal_id:

integer (int32)

render_id:

integer (int32)

GET /api/v0/feature-extraction

View parameters used for the feature extraction

default

object

analyzer:

string

word

chunk_size:

integer (int32)

data_dir:

string

filenames:

string[]

string

id:

string

max_df:

number

min_df:

number

n_features:

integer (int32)

100001

n_jobs:

integer (int32)

1

n_samples:

integer (int32)

n_samples_processed:

integer (int32)

ngram_range:

integer[]

1,1

integer (int32)

norm_alpha:

number

0.75

overwrite:

boolean

parse_email_headers:

boolean

preprocess:

string[]

string

stop_words:

string

english

use_hashing:

boolean

weighting:

string

nnc

POST /api/v0/feature-extraction

Initialize the feature extraction on a document collection.

Parameters

  • n_features: [optional] number of features (overlapping character/word n-grams that are hashed). n_features refers to the number of buckets in the hash. The larger the number, the fewer collisions. (default: 1100000)
  • analyzer: ‘word’, ‘char’, ‘char_wb’ Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries. ( default: ‘word’)
  • ngram_range : tuple (min_n, max_n), default=(1, 1) The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

  • stop_words: “english” or “None” Remove stop words from the resulting tokens. Only applies for the “word” analyzer. If “english”, a built-in stop word list for English is used. ( default: “english”)

  • n_jobs: The maximum number of concurrently running jobs (default: 1)
  • chuck_size: The number of documents simultaneously processed by a running job (default: 5000)
  • weighting: the SMART notation for document term weighting and normalization. In the form , see https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System
  • norm_alpha: the alpha value used for pivoted normalization

  • use_hashing: Enable hashing. This option must be set to True for classification and set to False for clustering. (default: True)

  • min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is ignored when hashing is used.
  • max_df: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold. This value is ignored when hashing is used.
  • parse_email_headers: when documents are emails, attempt to parse the information contained in the header (default: False)
  • preprocess: a list of pre-processing steps to apply before vectorization. A subset of [’emails_ignore_header’], default: [].
  • id: (optional) custom dataset id. Can only contain letters, numbers, “_” or “-“. It must also be between 2 and 50 characters long.
  • overwrite: if a custom dataset id was provided, and it already exists, overwrite it. Default: false

 

analyzer:

string

word

chunk_size:

integer (int32)

data_dir:

string

id:

string

max_df:

number

min_df:

number

n_features:

integer (int32)

100001

n_jobs:

integer (int32)

1

ngram_range:

integer[]

1,1

integer (int32)

norm_alpha:

number

0.75

overwrite:

boolean

parse_email_headers:

boolean

preprocess:

string[]

string

stop_words:

string

english

use_hashing:

boolean

weighting:

string

nnc

default
id:

string

DELETE /api/v0/feature-extraction/{dsid}

Delete a processed dataset

dsid path string
default

GET /api/v0/feature-extraction/{dsid}

Load extracted features (and obtain the processing status)

dsid path string

POST /api/v0/feature-extraction/{dsid}

Run feature extraction on a dataset,

Parameters

  • data_dir: [optional] relative path to the directory with the input files. Either data_dir or dataset_definition must be provided.
  • dataset_definition: [optional] a list of dictionaries [{'file_path': <str>, 'content': <str>, 'document_id': <int>, 'rendition_id': <int>}, ...] where document_id and rendition_id are optional, while either file_path or content field must be provided.
  • vectorize: [optional] this option can be used to ingest the dataset_definition in batches (optionally with document content), then make one final call to vectorize all sent documents (bool, default: True)
  • document_id_generator: [optional] if the document_id is not provided, this specifies how it is generated. If indexed_file_path the document_id is given by the index of the sorted file_path, otherwise if infer_file_path the document_id is inferred from the file_path strings, removing all non digit characters. In this second case, the file_path must contain a unique numeric ID (default: indexed_file_path)

 

data_dir:

string

dataset_definition:

object[]

object

content:

string

document_id:

integer (int32)

file_path:

string

rendition_id:

integer (int32)

document_id_generator:

string

indexed_file_path

vectorize:

boolean

true

dsid path string
default
id:

string

POST /api/v0/feature-extraction/{dsid}/append

Add new documents to an existing processed dataset.
This will also automatically update the LSI model if any
is present. Raw documents on disk are not affected.

This operation cannot be undone.

Warning: all categorization, clustering, duplicate detection and
email threading models associated with this dataset will be removed and
need to be re-trained.

Parameters

  • data_dir: [optional] relative path to the directory with the input files. Either data_dir or dataset_definition must be provided.
  • dataset_definition: [optional] a list of dictionaries [{'file_path': <str>, 'document_id': <int>, 'rendition_id': <int>}, ...] where rendition_id are optional, while either file_path or content field must be provided.

 

data_dir:

string

dataset_definition:

object[]

object

content:

string

document_id:

integer (int32)

file_path:

string

rendition_id:

integer (int32)

dsid path string
default

POST /api/v0/feature-extraction/{dsid}/delete

Remove documents from an existing processed dataset.
This will also automatically update the LSI model if any
is present. Raw documents on disk are not affected.

     This operation cannot be undone.

Warning: all categorization, clustering, duplicate detection and
email threading models associated with this dataset will be removed and
need to be re-trained.

**Parameters**
- `dataset_definition`: [optional] a list of dictionaries `[{'file_path': <str>, 'document_id': <int>, 'rendition_id': <int>}, ...]` where `rendition_id` are optional.

 

dataset_definition:

object[]

object

document_id:

integer (int32)

file_path:

string

rendition_id:

integer (int32)

dsid path string
default

POST /api/v0/feature-extraction/{dsid}/id-mapping

Compute correspondence between id fields for documents.
At least one of the fields used for indexing must be provided,
and all the rest will be computed (if available).
If the data parameter is not provided, return all the correspondence table

Parameters

  • data: the ids of documents used as the query
  • return_file_path: whether the results should include the file path

 

data:

object[]

object

document_id:

integer (int32)

file_path:

string

internal_id:

integer (int32)

render_id:

integer (int32)

return_file_path:

boolean

true

dsid path string
default
data:

object[]

object

document_id:

integer (int32)

file_path:

string

internal_id:

integer (int32)

render_id:

integer (int32)

GET /api/v0/lsi/

List existing LSI models

 

parent_id:

string

default

object

n_components:

integer (int32)

parent_id:

string

POST /api/v0/lsi/

Build a Latent Semantic Indexing (LSI) model

Recommended data ingestion options also include, use_idf=1, sublinear_tf=0, binary=0.

The recommended value for the n_components (dimensions of the SVD decompositions) is
in the range.

Parameters

  • n_components: Desired dimensionality of the output data. Must be strictly less than the number of features.
  • parent_id: parent dataset identified by dataset_id
  • alpha: floor on the number of components used with small datasets
  • id: (optional) custom model id. Can only contain letters, numbers, “_” or “-“. It must also be between 2 and 50 characters long.
  • overwrite: if a custom model id was provided, and it already exists, overwrite it. Default: false

 

alpha:

number

0.33

id:

string

n_components:

integer (int32)

150

overwrite:

boolean

parent_id:

string

default
explained_variance:

number

id:

string

DELETE /api/v0/lsi/{mid}

Delete a Latent Semantic Indexing (LSI) model

mid path string
default

GET /api/v0/lsi/{mid}

Show Latent Semantic Indexing (LSI) model parameters

mid path string
default
n_components:

integer (int32)

parent_id:

string

POST /api/v0/metrics/categorization

Compute categorization metrics to assess the quality
of categorization.

In the case of binary categrorization, category labels are sorted alphabetically
and the second one is expected to be the positive one.

Parameters

  • y_true: ground truth categorization data
  • y_pred: predicted categorization results
  • metrics: list of str. Metrics to compute, any combination of “precision”, “recall”, “f1”, “roc_auc”

 

metrics:

string[]

string

y_pred:

object[]

object

document_id:

integer (int32)

render_id:

integer (int32)

scores:

object[]

object

category:

string

document_id:

integer (int32)

render_id:

integer (int32)

score:

number

y_true:

object[]

object

category:

string

document_id:

integer (int32)

render_id:

integer (int32)

default
average_precision:

number

f1:

number

precision:

number

recall:

number

recall_at_20p:

number

roc_auc:

number

POST /api/v0/metrics/clustering

Compute clustering metrics to assess the quality
of categorization, comparing the groud truth labels with the predicted ones.

Parameters

  • labels_true: list of int. Ground truth clustering labels
  • labels_pred: list of int. Predicted clustering labels
  • metrics: list of str. Metrics to compute, any combination of “adjusted_rand”, “adjusted_mutual_info”, “v_measure”

 

labels_pred:

integer[]

integer (int32)

labels_true:

integer[]

integer (int32)

metrics:

string[]

string

default
adjusted_mutual_info:

number

adjusted_rand:

number

v_measure:

number

POST /api/v0/metrics/duplicate-detection

Compute duplicate detection metrics to assess the quality
of categorization, comparing the groud truth labels with the predicted ones.

Parameters

  • labels_true: list of int. Ground truth clustering labels
  • labels_pred: list of int. Predicted clustering labels
  • metrics: list of str. Metrics to compute, any combination of “ratio_duplicates”, “f1_same_duplicates”, “mean_duplicates_count”

 

labels_pred:

integer[]

integer (int32)

labels_true:

integer[]

integer (int32)

metrics:

string[]

string

default
f1_same_duplicates:

number

mean_duplicates_count:

number

ratio_duplicates:

number

POST /api/v0/search/

Perform document search (if parent_id is a dataset_id) or a semantic search (if parent_id is a lsi_id).

Parameters

  • parent_id : the id of the previous processing step (either dataset_id or lsi_id)
  • query : the seach query. Either query or query_document_id must be provided.
  • query_document_id : the id of the document used as the search query. Either query or query_document_id must be provided.
  • metric : The similarity returned by nearest neighbor classifier in [‘cosine’, ‘jaccard’, ‘cosine-positive’].
  • min_score : filter out results below a similarity threashold
  • max_results : return only the first max_results documents. If max_results <= 0 all documents are returned.
  • sort_by : if provided and not None, the field used for sorting results. Valid values are
  • sort_order: the sort order (if applicable), one of [‘ascending’, ‘descending’]
  • batch_id: retrieve a given subset of scores (-1 to retrieve all). Default: 0
  • batch_size: the number of document scores retrieved per batch. Default: 10000
    • subset_document_id: apply prediction to a subset of document_id.

 

batch_id:

integer (int32)

batch_size:

integer (int32)

10000

max_results:

integer (int32)

metric:

string

cosine

min_score:

number

-1

parent_id:

string

query:

string

query_document_id:

integer (int32)

sort_by:

string

score

sort_order:

string

descending

subset_document_id:

integer[]

integer (int32)

default
data:

object[]

object

document_id:

integer (int32)

render_id:

integer (int32)

score:

number

pagination:

object

batch_id:

integer (int32)

batch_id_last:

integer (int32)

current_response_count:

integer (int32)

total_response_count:

integer (int32)

POST /api/v0/stop-words/

Store a list of custom stop words

 

name:

string

stop_words:

string[]

string

default
name:

string

stop_words:

string[]

string

DELETE /api/v0/stop-words/{name}

Delete a stored custom stop words

name path string
default

GET /api/v0/stop-words/{name}

Load a stored list of stop words

name path string
default
name:

string

stop_words:

string[]

string

Parameter definitions

Schema definitions