lexical_summary {R.temis}R Documentation

lexical_summary

Description

Build a lexical summary table, optionally over a variable.

Usage

lexical_summary(dtm, corpus, variable = NULL, unit = c("document", "global"))

Arguments

dtm

A DocumentTermMatrix containing the terms to summarize, which may have been stemmed.

corpus

A Corpus object containing the original texts from which dtm was constructed.

variable

An optional vector with one element per document indicating to which category it belongs. If 'NULL, per-document measures are returned.

unit

When variable is not NULL, defines the way measures are aggregated (see below).

Details

Words are defined as the forms of two or more characters present in the texts before stemming and stopword removal. On the contrary, unique terms are extracted from dtm, which means they do not include words that were removed from it, and that words different in the original text might become identical terms if stemming was performed. Please note that percentages for terms and words are computed with regard respectively to the total number of terms and of words, so the denominators are not the same for all measures.

When variable is not NULL, unit defines two different ways of aggregating per-document statistics into per-category measures:

This distinction does not make sense when variable=NULL: in this case, "level" in the above explanation corresponds to "document", and two columns are provided about the whole corpus.

Value

A table object with the following information for each document or each category of documents in the corpus:

Examples


file <- system.file("texts", "reut21578-factiva.xml", package="tm.plugin.factiva")
corpus <- import_corpus(file, "factiva", language="en")
dtm <- build_dtm(corpus)
lexical_summary(dtm, corpus)


[Package R.temis version 0.1.3 Index]