lexicalize {lda} | R Documentation |
Generate LDA Documents from Raw Text
Description
This function reads raw text in doclines format and returns a corpus and vocabulary suitable for the inference procedures defined in the lda package.
Usage
lexicalize(doclines, sep = " ", lower = TRUE, count = 1L, vocab = NULL)
Arguments
doclines |
A character vector of document lines to be used to construct a corpus. See details for a description of the format of these lines. |
sep |
Separator string which is used to tokenize the input strings (default ‘ ’). |
lower |
Logical indicating whether or not to convert all tokens to lowercase (default ‘TRUE’). |
count |
An integer scaling factor to be applied to feature counts. A single observation of a feature will be rendered as count observations in the return value (the default value, ‘1’, is appropriate in most cases). |
vocab |
If left unspecified (or |
Details
This function first tokenizes a character vector by splitting each entry of the vector by sep (note that this is currently a fixed separator, not a regular expression). If lower is ‘TRUE’, then the tokens are then all converted to lowercase.
At this point, if vocab is NULL
, then a vocabulary is
constructed from the set of unique tokens appearing across all
character vectors. Otherwise, the tokens derived from the character
vectors are filtered so that only those appearing in vocab are
retained.
Finally, token instances within each document (i.e., original
character string) are tabulated in the format described in
lda.collapsed.gibbs.sampler
.
Value
If vocab is unspecified or NULL
, a list with two components:
documents |
A list of document matrices in the format described in |
vocab |
A character vector of unique tokens occurring in the corpus. |
Note
Because of the limited tokenization and filtering capabilities of this function, it may not be useful in many cases. This may be resolved in a future release.
Author(s)
Jonathan Chang (slycoder@gmail.com)
See Also
lda.collapsed.gibbs.sampler
for the format of
the return value.
read.documents
to generate the same output from a file
encoded in LDA-C format.
word.counts
to compute statistics associated with a
corpus.
concatenate.documents
for operations on a collection of documents.
Examples
## Generate an example.
example <- c("I am the very model of a modern major general",
"I have a major headache")
corpus <- lexicalize(example, lower=TRUE)
## corpus$documents:
## $documents[[1]]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 1 2 3 4 5 6 7 8 9
## [2,] 1 1 1 1 1 1 1 1 1 1
##
## $documents[[2]]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 10 6 8 11
## [2,] 1 1 1 1 1
## corpus$lexicon:
## $vocab
## [1] "i" "am" "the" "very" "model" "of"
## [7] "a" "modern" "major" "general" "have" "headache"
## Only keep words that appear at least twice:
to.keep <- corpus$vocab[word.counts(corpus$documents, corpus$vocab) >= 2]
## Re-lexicalize, using this subsetted vocabulary
documents <- lexicalize(example, lower=TRUE, vocab=to.keep)
## documents:
## [[1]]
## [,1] [,2] [,3]
## [1,] 0 1 2
## [2,] 1 1 1
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 0 1 2
## [2,] 1 1 1