word.counts {lda} | R Documentation |
Compute Summary Statistics of a Corpus
Description
These functions compute summary statistics of a corpus.
word.counts
computes the word counts for a set of documents,
while documents.length
computes the length of the documents in
a corpus.
Usage
word.counts(docs, vocab = NULL)
document.lengths(docs)
Arguments
docs |
A list of matrices specifying the corpus. See
|
vocab |
An optional character vector specifying the levels (i.e., labels) of
the vocabulary words. If unspecified (or |
Value
word.counts
returns an object of class ‘table’ which
contains counts for the number of times each word appears in the input
corpus. If vocab is specified, then the levels of the table
will be set to vocab. Otherwise, the levels are automatically
inferred from the corpus (typically integers 0:(V-1), where
V indicates the number of unique words in the corpus).
documents.length
returns a integer vector of length
length(docs)
, each entry of which corresponds to the
length (sum of the counts of all features) of each document in
the corpus.
Author(s)
Jonathan Chang (slycoder@gmail.com)
See Also
lda.collapsed.gibbs.sampler
for the input format of
these functions.
read.documents
and lexicalize
for ways of
generating the input to these functions.
concatenate.documents
for operations on a corpus.
Examples
## Load the cora dataset.
data(cora.vocab)
data(cora.documents)
## Compute word counts using raw feature indices.
wc <- word.counts(cora.documents)
head(wc)
## 0 1 2 3 4 5
## 136 876 14 111 19 29
## Recompute them using the levels defined by the vocab file.
wc <- word.counts(cora.documents, cora.vocab)
head(wc)
## computer algorithms discovering patterns groups protein
## 136 876 14 111 19 29
head(document.lengths(cora.documents))
## [1] 64 39 76 84 52 24