Weight by Term Frequency - Inverse Document Frequency


Weight a term-document matrix by term frequency - inverse document frequency.


weightTfIdf(m, normalize = TRUE)



A TermDocumentMatrix in term frequency format.


A Boolean value indicating whether the term frequencies should be normalized.


Formally this function is of class WeightingFunction with the additional attributes name and acronym.

Term frequency tfi,j\mathit{tf}_{i,j} counts the number of occurrences ni,jn_{i,j} of a term tit_i in a document djd_j. In the case of normalization, the term frequency tfi,j\mathit{tf}_{i,j} is divided by knk,j\sum_k n_{k,j}.

Inverse document frequency for a term tit_i is defined as

idfi=log2D{dtid}\mathit{idf}_i = \log_2 \frac{|D|}{|\{d \mid t_i \in d\}|}

where D|D| denotes the total number of documents and where {dtid}|\{d \mid t_i \in d\}| is the number of documents where the term tit_i appears.

Term frequency - inverse document frequency is now defined as tfi,jidfi\mathit{tf}_{i,j} \cdot \mathit{idf}_i.


The weighted matrix.


