R: Generate LDA Documents from Raw Text

lexicalize {lda}

R Documentation

Generate LDA Documents from Raw Text

Description

This function reads raw text in doclines format and returns a corpus and vocabulary suitable for the inference procedures defined in the lda package.

Usage

lexicalize(doclines, sep = " ", lower = TRUE, count = 1L, vocab = NULL)

Arguments

`doclines`	A character vector of document lines to be used to construct a corpus. See details for a description of the format of these lines.
`sep`	Separator string which is used to tokenize the input strings (default ‘⁠ ⁠’).
`lower`	Logical indicating whether or not to convert all tokens to lowercase (default ‘⁠TRUE⁠’).
`count`	An integer scaling factor to be applied to feature counts. A single observation of a feature will be rendered as `count` observations in the return value (the default value, ‘⁠1⁠’, is appropriate in most cases).
`vocab`	If left unspecified (or `NULL`), the vocabulary for the corpus will be automatically inferred from the observed tokens. Otherwise, this parameter should be a character vector specifying acceptable tokens. Tokens not appearing in this list will be filtered from the documents.

Details

This function first tokenizes a character vector by splitting each entry of the vector by sep (note that this is currently a fixed separator, not a regular expression). If lower is ‘⁠TRUE⁠’, then the tokens are then all converted to lowercase.

At this point, if vocab is NULL, then a vocabulary is constructed from the set of unique tokens appearing across all character vectors. Otherwise, the tokens derived from the character vectors are filtered so that only those appearing in vocab are retained.

Finally, token instances within each document (i.e., original character string) are tabulated in the format described in lda.collapsed.gibbs.sampler.

Value

If vocab is unspecified or NULL, a list with two components:

`documents`	A list of document matrices in the format described in `lda.collapsed.gibbs.sampler`.
`vocab`	A character vector of unique tokens occurring in the corpus.

Note

Because of the limited tokenization and filtering capabilities of this function, it may not be useful in many cases. This may be resolved in a future release.

Author(s)

Jonathan Chang (slycoder@gmail.com)

Examples

## Generate an example.
example <- c("I am the very model of a modern major general",
             "I have a major headache")

corpus <- lexicalize(example, lower=TRUE)

## corpus$documents:
## $documents[[1]]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    1    2    3    4    5    6    7    8     9
## [2,]    1    1    1    1    1    1    1    1    1     1
## 
## $documents[[2]]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0   10    6    8   11
## [2,]    1    1    1    1    1

## corpus$lexicon:
## $vocab
## [1] "i"        "am"       "the"      "very"     "model"    "of"      
## [7] "a"        "modern"   "major"    "general"  "have"     "headache"

## Only keep words that appear at least twice:
to.keep <- corpus$vocab[word.counts(corpus$documents, corpus$vocab) >= 2]

## Re-lexicalize, using this subsetted vocabulary
documents <- lexicalize(example, lower=TRUE, vocab=to.keep)

## documents:
## [[1]]
##      [,1] [,2] [,3]
## [1,]    0    1    2
## [2,]    1    1    1
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]    0    1    2
## [2,]    1    1    1

[Package lda version 1.5.2 Index]