Optimizing TF-IDF schemes
Optimizing TF-IDF schemes
An example of optimizing TF-IDF weighting schemes using 5 fold cross-validation
We load and vectorize 2 classes from the 20 newsgroup dataset,
then compute baseline categorization performance using Logistic Regression and the TF-IDF transfomer from scikit-learn
Out:
Baseline TF-IDF categorization accuracy: 0.973
Next, we search, using 5 fold cross-validation, for the best TF-IDF weighting scheme among the 80+ combinations supported by SmartTfidfTransformer
. Two hyper-parameters are worth optimizing in this case,
weighting
parameter that defines the TF-IDF weighting (see the TF-IDF schemes user manual section for more details)norm_alpha
is the α parameter in the pivoted normalization whenweighting=="???p"
.
To reduce the search parameter space in this example, we also can exclude the case when either the term weighting, feature weighing or normalization is not used as it expected to yield worse than baseline performance. We also exclude the non smoothed IDF weightings (?t?
, ?p?
) since thay return NaNs when some of the document frequency is 0 (which will be the case during cross-validation). Finally, by noticing that the case xxxp
with norm_alpha=1.0
corresponds to the weighing xxx
(i.e. with pivoted normalization disabled) we can reduce the search space even further.
Out:
Fitting 5 folds for each of 300 candidates, totalling 1500 fits Best CV params: weighting=lsup, norm_alpha=0.778 Best TF-IDF categorization accuracy: 0.990
In this example, by tuning TF-IDF weighting scheme with pivoted normalization, we obtain a categorization accuracy score of 0.99 as compared to a baseline TF-IDF score of 0.973. It is also interesting to notice that the best weighting hyper-parameter in this case is lnup
which corresponds to the “unique pivoted normalization” case proposed by Singhal et al. (1996), although with a different α value.
Total running time of the script: ( 1 minutes 47.518 seconds)