termFrequencies {RcmdrPlugin.temis} | R Documentation |
Frequency of chosen terms in the corpus
Description
List terms with the highest number of occurrences in the document-term matrix of a corpus, possibly grouped by the levels of a variable.
Usage
termFrequencies(dtm, terms, variable = NULL, n = 25, by.term = FALSE)
Arguments
dtm |
a document-term matrix. |
terms |
one or more terms, i.e. column names of |
variable |
a vector whose length is the number of rows of |
n |
the number of terms to report for each level. |
by.term |
whether the third dimension of the array should be terms instead of levels. |
Details
The probability is that of observing such extreme frequencies of the considered term in the level, under an hypergeometric distribution based on its global frequency in the corpus and on the number of occurrences of all terms in the document or variable level considered. The positive or negative character of the association is visible from the sign of the t value, or by comparing the value of the “% Term/Level” column with that of the “Global %” column.
Value
If variable = NA
, one matrix with columns “Global” and Global %
(see below).
Else, an array with seven columns:
\dQuote{% Term/Level} |
the percent of the term's occurrences in all terms occurrences in the level. |
\dQuote{% Level/Term} |
the percent of the term's occurrences that appear in the level (rather than in other levels). |
\dQuote{Global %} |
the percent of the term's occurrences in all terms occurrences in the corpus. |
\dQuote{Global} |
the number of occurrences of the term in the corpus. |
\dQuote{Level} |
the number of occurrences of the term (“internal”). |
\dQuote{t value} |
the quantile of a normal distribution corresponding the probability “Prob.”. |
\dQuote{Prob.} |
the probability of observing such an extreme (high or low) number of occurrences of the term in the level, under an hypergeometric distribution. |
Author(s)
Milan Bouchet-Valat
See Also
specificTerms
, DocumentTermMatrix