build.corpus {textreg} | R Documentation |
Build a corpus that can be used in the textreg call.
Description
Pre-building a corpus allows for calling multiple textregs without doing a lot of initial data processing (e.g., if you want to explore different ban lists or regularization parameters)
Usage
build.corpus(corpus, labeling, banned = NULL, verbosity = 1,
token.type = "word")
Arguments
corpus |
A list of strings or a corpus from the |
labeling |
A vector of +1/-1 or TRUE/FALSE indicating which documents are considered relevant and which are baseline. The +1/-1 can contain 0 whcih means drop the document. |
banned |
List of words that should be dropped from consideration. |
verbosity |
Level of output. 0 is no printed output. |
token.type |
"word" or "character" as tokens. |
Details
See the bathtub vignette for more complete discussion of this method and the options you might pass to it.
A textreg.corpus object is not a tm
-style corpus. In particular, all text
pre-processing, etc., to text should be done to the data before building the
textreg.corpus object.
Value
A textreg.corpus
object.
Note
Unfortunately, the process of seperating out the textreg call and the build.corpus call is not quite as clean as one would hope. The build.corpus call moves the text into the C++ memory, but the way the search tree is built for the regression it is hard to salvage it across runs and so this is of limited use. In particular, the labeling and banned words cannot be easily changed. Future versions of the package would ideally remedy this.
Examples
data( testCorpora )
textreg( testCorpora$testI$corpus, testCorpora$testI$labelI, c(), C=1, verbosity=1 )