Select Page

# Feature Extraction

## Feature extraction

For a general introduction to feature extraction with textual documents see the scikit-learn documentation.

### TF-IDF schemes

#### SMART TF-IDF schemes

FreeDiscovery extends sklearn.feature_extraction.text.TfidfTransformer with a larger number of TF-IDF weighting and normalization schemes in SmartTfidfTransformer. It follows the SMART Information Retrieval System notation,

The different options are descibed in more detail in the table below,

 Term frequency Document frequency Normalization n (natural): tft,d tft,d${{\text{tf}}_{t,d}}$ n (no): 1 n (none): 1 l (logarithm): 1+log(tft,d) 1+log(tft,d)$1+log({\displaystyle {\text{tf}}_{t,d}})$ t (idf): logNdft logNdft$log{\displaystyle {\tfrac {N}{df_{t}}}}$ c (cosine): Σtϵdw2t−−−−−−√ Σtϵdwt2${\displaystyle {\sqrt{\Sigma_ {t\epsilon d}{w_{t}^{2}}}}}$ a (augmented): 0.5+0.5×tft,dmax(tft,d) 0.5+0.5×tft,dmax(tft,d)$0.5 + {\displaystyle {\tfrac {0.5\times {\text{tf}}_{t,d}}{{\text{max(tf}}_{t,d})}}}$ s (smoothed idf): logN+1dft+1 logN+1dft+1$log{\displaystyle {\tfrac {N + 1}{df_{t } + 1}}}$ l (length): Σtϵd|wt| Σtϵd|wt|${\displaystyle \Sigma_{t\epsilon d}{ |w_{t}| }}$ b (boolean): {1,0,if tft,d>0otherwise {1,if tft,d>00,otherwise${\displaystyle {\begin{cases}1,&{\text{if tf}}_{t,d}>0\\0,&{\text{otherwise}}\end{cases}}}$ p (prob idf): logN−dftdft logN−dftdft${\displaystyle {\text{log}}{\tfrac {N-df_{t}}{df_{t}}}}$ u (unique): Σtϵdbool(|wt|) Σtϵdbool(|wt|)${\displaystyle \Sigma_ {t\epsilon d} \textbf{bool}\left(|w_{t}|\right) }$ L (log average): 1+log(tft,d)1+log(avgtϵd(tft,d)) 1+log(tft,d)1+log(avgtϵd(tft,d))${\displaystyle {\tfrac {1+{\text{log}}({\text{tf}}_{t,d})}{1+{\text{log}}({\text{avg}}_{t\epsilon d}({\text{tf}}_{t,d}))}}}$ d (smoothed prob idf): logN+1−dftdft+1 logN+1−dftdft+1${\displaystyle {\text{log}}{\tfrac {N+1-df_{t}}{df_{t} + 1}}}$

#### Pivoted document length normalization

In addition to standard TF-IDF normalizations above, pivoted normalization was proposed by Singal et al. as a way to avoid over-penalising long documents. It can be enabled with the weighting='???p' parameter. For each document the normalization term
Vd is replaced by,

(1−α)avg(Vd)+αVd

where α (norm_alpha) is a user defined parameter, such as α∈. If norm_alpha=1 the pivot cancels out and this case corresponds to regular TF-IDF normalization.

See the example on Optimizing TF-IDF schemes for a more practical illustration.