R: Compute Summary Statistics of a Corpus

word.counts {lda}

R Documentation

Compute Summary Statistics of a Corpus

Description

These functions compute summary statistics of a corpus. word.counts computes the word counts for a set of documents, while documents.length computes the length of the documents in a corpus.

Usage

word.counts(docs, vocab = NULL)

document.lengths(docs)

Arguments

`docs`	A list of matrices specifying the corpus. See `lda.collapsed.gibbs.sampler` for details on the format of this variable.
`vocab`	An optional character vector specifying the levels (i.e., labels) of the vocabulary words. If unspecified (or `NULL`), the levels will be automatically inferred from the corpus.

Value

word.counts returns an object of class ‘⁠table⁠’ which contains counts for the number of times each word appears in the input corpus. If vocab is specified, then the levels of the table will be set to vocab. Otherwise, the levels are automatically inferred from the corpus (typically integers 0:(V-1), where V indicates the number of unique words in the corpus).

documents.length returns a integer vector of length length(docs), each entry of which corresponds to the length (sum of the counts of all features) of each document in the corpus.

Author(s)

Jonathan Chang (slycoder@gmail.com)

Examples

## Load the cora dataset.
data(cora.vocab)
data(cora.documents)

## Compute word counts using raw feature indices.
wc <- word.counts(cora.documents)
head(wc)
##   0   1   2   3   4   5 
## 136 876  14 111  19  29 

## Recompute them using the levels defined by the vocab file.
wc <- word.counts(cora.documents, cora.vocab)
head(wc)
##   computer  algorithms discovering    patterns      groups     protein 
##        136         876          14         111          19          29 

head(document.lengths(cora.documents))
## [1] 64 39 76 84 52 24

[Package lda version 1.5.2 Index]