| weightSMART {tm} | R Documentation |
SMART Weightings
Description
Weight a term-document matrix according to a combination of weights specified in SMART notation.
Usage
weightSMART(m, spec = "nnn", control = list())
Arguments
m |
A |
spec |
a character string consisting of three characters. The first letter specifies a term frequency schema, the second a document frequency schema, and the third a normalization schema. See Details for available built-in schemata. |
control |
a list of control parameters. See Details. |
Details
Formally this function is of class WeightingFunction with the
additional attributes name and acronym.
The first letter of spec specifies a weighting schema for term
frequencies of m:
- "n"
(natural)
\mathit{tf}_{i,j}counts the number of occurrencesn_{i,j}of a termt_iin a documentd_j. The input term-document matrixmis assumed to be in this standard term frequency format already.- "l"
(logarithm) is defined as
1 + \log_2(\mathit{tf}_{i,j}).- "a"
(augmented) is defined as
0.5 + \frac{0.5 * \mathit{tf}_{i,j}}{\max_i(\mathit{tf}_{i,j})}.- "b"
(boolean) is defined as 1 if
\mathit{tf}_{i,j} > 0and 0 otherwise.- "L"
(log average) is defined as
\frac{1 + \log_2(\mathit{tf}_{i,j})}{1+\log_2(\mathrm{ave}_{i\in j}(\mathit{tf}_{i,j}))}.
The second letter of spec specifies a weighting schema of
document frequencies for m:
- "n"
(no) is defined as 1.
- "t"
(idf) is defined as
\log_2 \frac{N}{\mathit{df}_t}where\mathit{df}_tdenotes how often termtoccurs in all documents.- "p"
(prob idf) is defined as
\max(0, \log_2(\frac{N - \mathit{df}_t}{\mathit{df}_t})).
The third letter of spec specifies a schema for normalization
of m:
- "n"
(none) is defined as 1.
- "c"
(cosine) is defined as
\sqrt{\mathrm{col\_sums}(m ^ 2)}.- "u"
(pivoted unique) is defined as
\mathit{slope} * \sqrt{\mathrm{col\_sums}(m ^ 2)} + (1 - \mathit{slope}) * \mathit{pivot}where bothslopeandpivotmust be set via named tags in thecontrollist.- "b"
(byte size) is defined as
\frac{1}{\mathit{CharLength}^\alpha}. The parameter\alphamust be set via the named tagalphain thecontrollist.
The final result is defined by multiplication of the chosen term frequency component with the chosen document frequency component with the chosen normalization component.
Value
The weighted matrix.
References
Christopher D. Manning and Prabhakar Raghavan and Hinrich Schütze (2008). Introduction to Information Retrieval. Cambridge University Press, ISBN 0521865719.
Examples
data("crude")
TermDocumentMatrix(crude,
control = list(removePunctuation = TRUE,
stopwords = TRUE,
weighting = function(x)
weightSMART(x, spec = "ntc")))