weightSMART {tm}R Documentation

SMART Weightings

Description

Weight a term-document matrix according to a combination of weights specified in SMART notation.

Usage

weightSMART(m, spec = "nnn", control = list())

Arguments

m

A TermDocumentMatrix in term frequency format.

spec

a character string consisting of three characters. The first letter specifies a term frequency schema, the second a document frequency schema, and the third a normalization schema. See Details for available built-in schemata.

control

a list of control parameters. See Details.

Details

Formally this function is of class WeightingFunction with the additional attributes name and acronym.

The first letter of spec specifies a weighting schema for term frequencies of m:

"n"

(natural) tfi,j\mathit{tf}_{i,j} counts the number of occurrences ni,jn_{i,j} of a term tit_i in a document djd_j. The input term-document matrix m is assumed to be in this standard term frequency format already.

"l"

(logarithm) is defined as 1+log2(tfi,j)1 + \log_2(\mathit{tf}_{i,j}).

"a"

(augmented) is defined as 0.5+0.5tfi,jmaxi(tfi,j)0.5 + \frac{0.5 * \mathit{tf}_{i,j}}{\max_i(\mathit{tf}_{i,j})}.

"b"

(boolean) is defined as 1 if tfi,j>0\mathit{tf}_{i,j} > 0 and 0 otherwise.

"L"

(log average) is defined as 1+log2(tfi,j)1+log2(aveij(tfi,j))\frac{1 + \log_2(\mathit{tf}_{i,j})}{1+\log_2(\mathrm{ave}_{i\in j}(\mathit{tf}_{i,j}))}.

The second letter of spec specifies a weighting schema of document frequencies for m:

"n"

(no) is defined as 1.

"t"

(idf) is defined as log2Ndft\log_2 \frac{N}{\mathit{df}_t} where dft\mathit{df}_t denotes how often term tt occurs in all documents.

"p"

(prob idf) is defined as max(0,log2(Ndftdft))\max(0, \log_2(\frac{N - \mathit{df}_t}{\mathit{df}_t})).

The third letter of spec specifies a schema for normalization of m:

"n"

(none) is defined as 1.

"c"

(cosine) is defined as col_sums(m2)\sqrt{\mathrm{col\_sums}(m ^ 2)}.

"u"

(pivoted unique) is defined as slopecol_sums(m2)+(1slope)pivot\mathit{slope} * \sqrt{\mathrm{col\_sums}(m ^ 2)} + (1 - \mathit{slope}) * \mathit{pivot} where both slope and pivot must be set via named tags in the control list.

"b"

(byte size) is defined as 1CharLengthα\frac{1}{\mathit{CharLength}^\alpha}. The parameter α\alpha must be set via the named tag alpha in the control list.

The final result is defined by multiplication of the chosen term frequency component with the chosen document frequency component with the chosen normalization component.

Value

The weighted matrix.

References

Christopher D. Manning and Prabhakar Raghavan and Hinrich Schütze (2008). Introduction to Information Retrieval. Cambridge University Press, ISBN 0521865719.

Examples

data("crude")
TermDocumentMatrix(crude,
                   control = list(removePunctuation = TRUE,
                                  stopwords = TRUE,
                                  weighting = function(x)
                                  weightSMART(x, spec = "ntc")))

[Package tm version 0.7-13 Index]