build.corpus {textreg}R Documentation

Build a corpus that can be used in the textreg call.

Description

Pre-building a corpus allows for calling multiple textregs without doing a lot of initial data processing (e.g., if you want to explore different ban lists or regularization parameters)

Usage

build.corpus(corpus, labeling, banned = NULL, verbosity = 1,
  token.type = "word")

Arguments

corpus

A list of strings or a corpus from the tm package.

labeling

A vector of +1/-1 or TRUE/FALSE indicating which documents are considered relevant and which are baseline. The +1/-1 can contain 0 whcih means drop the document.

banned

List of words that should be dropped from consideration.

verbosity

Level of output. 0 is no printed output.

token.type

"word" or "character" as tokens.

Details

See the bathtub vignette for more complete discussion of this method and the options you might pass to it.

A textreg.corpus object is not a tm-style corpus. In particular, all text pre-processing, etc., to text should be done to the data before building the textreg.corpus object.

Value

A textreg.corpus object.

Note

Unfortunately, the process of seperating out the textreg call and the build.corpus call is not quite as clean as one would hope. The build.corpus call moves the text into the C++ memory, but the way the search tree is built for the regression it is hard to salvage it across runs and so this is of limited use. In particular, the labeling and banned words cannot be easily changed. Future versions of the package would ideally remedy this.

Examples

data( testCorpora )
textreg( testCorpora$testI$corpus, testCorpora$testI$labelI, c(), C=1, verbosity=1 )

[Package textreg version 0.1.5 Index]