textreg {textreg} | R Documentation |
Sparse regression of labeling vector onto all phrases in a corpus.
Description
Given a labeling and a corpus, find phrases that predict this labeling. This function calls a C++ function that builds a tree of phrases and searches it using greedy coordinate descent to solve the optimization problem associated with the associated sparse regression.
Usage
textreg(corpus, labeling, banned = NULL, objective.function = 2,
C = 1, a = 1, maxIter = 40, verbosity = 1,
step.verbosity = verbosity, positive.only = FALSE,
binary.features = FALSE, no.regularization = FALSE,
positive.weight = 1, Lq = 2, min.support = 1, min.pattern = 1,
max.pattern = 100, gap = 0, token.type = "word",
convergence.threshold = 1e-04)
Arguments
corpus |
A list of strings or a corpus from the |
labeling |
A vector of +1/-1 or TRUE/FALSE indicating which documents are considered relevant and which are baseline. The +1/-1 can contain 0 whcih means drop the document. |
banned |
List of words that should be dropped from consideration. |
objective.function |
2 is hinge loss. 0 is something. 1 is something else. |
C |
The regularization term. 0 is no regularization. |
a |
What percent of regularization should be L1 loss (a=1) vs L2 loss (a=0) |
maxIter |
Number of gradient descent steps to take (not including intercept adjustments) |
verbosity |
Level of output. 0 is no printed output. |
step.verbosity |
Level of output for line searches. 0 is no printed output. |
positive.only |
Disallow negative features if true |
binary.features |
Just code presence/absence of a feature in a document rather than count of feature in document. |
no.regularization |
Do not renormalize the features at all. (Lq will be ignored.) |
positive.weight |
Scale weight pf all positively marked documents by this value. (1, i.e., no scaling) is default) NOT FULLY IMPLEMENTED |
Lq |
Rescaling to put on the features (2 is standard). Can be from 1 up. Values above 10 invoke an infinity-norm. |
min.support |
Only consider phrases that appear this many times or more. |
min.pattern |
Only consider phrases this long or longer |
max.pattern |
Only consider phrases this short or shorter |
gap |
Allow phrases that have wildcard words in them. Number is how many wildcards in a row. |
token.type |
"word" or "character" as tokens. |
convergence.threshold |
How to decide if descent has converged. (Will go for three steps at this threshold to check for flatness.) |
Details
See the bathtub vignette for more complete discussion of this method and the options you might pass to it.
Value
A textreg.result
object.
Examples
data( testCorpora )
textreg( testCorpora$testI$corpus, testCorpora$testI$labelI, c(), C=1, verbosity=1 )