characteristic_docs {R.temis} | R Documentation |
characteristic_docs
Description
Print documents which are the most characteristic of each level of a variable, i.e. those with the lowest Chi-squared distance to the average vocabulary of documents belonging to that level.
Usage
characteristic_docs(corpus, dtm, variable, ndocs = 10, nterms = 25, p = 0.1)
Arguments
corpus |
A |
dtm |
A |
variable |
A vector of values giving the groups for which most frequent terms should be reported. |
ndocs |
The number of (most characteristic) documents to print. |
nterms |
The number of terms to highlight in documents. |
p |
The maximum p-value up to which specific terms should be hightlighted. |
Details
Occurrences of the nterms
most specific terms for each level are highlighted.
If stemming or other transformations have been applied to original words
using combine_terms
, all original words which have been transformed
to the specified terms are highlighted.
Value
A list with one Corpus
object for each level (invisibly).
Examples
file <- system.file("texts", "reut21578-factiva.xml", package="tm.plugin.factiva")
corpus <- import_corpus(file, "factiva", language="en")
dtm <- build_dtm(corpus)
characteristic_docs(corpus, dtm, meta(corpus)$Date)
# Also works when terms have been combined
dict <- dictionary(dtm)
dtm2 <- combine_terms(dtm, dict)
characteristic_docs(corpus, dtm2, meta(corpus)$Date)