R: characteristic

characteristic_docs {R.temis}

R Documentation

characteristic_docs

Description

Print documents which are the most characteristic of each level of a variable, i.e. those with the lowest Chi-squared distance to the average vocabulary of documents belonging to that level.

Usage

characteristic_docs(corpus, dtm, variable, ndocs = 10, nterms = 25, p = 0.1)

Arguments

`corpus`	A `Corpus` object.
`dtm`	A `DocumentTermMatrix` object corresponding to `corpus`.
`variable`	A vector of values giving the groups for which most frequent terms should be reported.
`ndocs`	The number of (most characteristic) documents to print.
`nterms`	The number of terms to highlight in documents.
`p`	The maximum p-value up to which specific terms should be hightlighted.

Details

Occurrences of the nterms most specific terms for each level are highlighted. If stemming or other transformations have been applied to original words using combine_terms, all original words which have been transformed to the specified terms are highlighted.

Value

A list with one Corpus object for each level (invisibly).

Examples


file <- system.file("texts", "reut21578-factiva.xml", package="tm.plugin.factiva")
corpus <- import_corpus(file, "factiva", language="en")
dtm <- build_dtm(corpus)
characteristic_docs(corpus, dtm, meta(corpus)$Date)

# Also works when terms have been combined
dict <- dictionary(dtm)
dtm2 <- combine_terms(dtm, dict)
characteristic_docs(corpus, dtm2, meta(corpus)$Date)

[Package R.temis version 0.1.3 Index]