R: Sparse regression of labeling vector onto all phrases in a...

textreg {textreg}

R Documentation

Sparse regression of labeling vector onto all phrases in a corpus.

Description

Given a labeling and a corpus, find phrases that predict this labeling. This function calls a C++ function that builds a tree of phrases and searches it using greedy coordinate descent to solve the optimization problem associated with the associated sparse regression.

Usage

textreg(corpus, labeling, banned = NULL, objective.function = 2,
  C = 1, a = 1, maxIter = 40, verbosity = 1,
  step.verbosity = verbosity, positive.only = FALSE,
  binary.features = FALSE, no.regularization = FALSE,
  positive.weight = 1, Lq = 2, min.support = 1, min.pattern = 1,
  max.pattern = 100, gap = 0, token.type = "word",
  convergence.threshold = 1e-04)

Arguments

`corpus`	A list of strings or a corpus from the `tm` package.
`labeling`	A vector of +1/-1 or TRUE/FALSE indicating which documents are considered relevant and which are baseline. The +1/-1 can contain 0 whcih means drop the document.
`banned`	List of words that should be dropped from consideration.
`objective.function`	2 is hinge loss. 0 is something. 1 is something else.
`C`	The regularization term. 0 is no regularization.
`a`	What percent of regularization should be L1 loss (a=1) vs L2 loss (a=0)
`maxIter`	Number of gradient descent steps to take (not including intercept adjustments)
`verbosity`	Level of output. 0 is no printed output.
`step.verbosity`	Level of output for line searches. 0 is no printed output.
`positive.only`	Disallow negative features if true
`binary.features`	Just code presence/absence of a feature in a document rather than count of feature in document.
`no.regularization`	Do not renormalize the features at all. (Lq will be ignored.)
`positive.weight`	Scale weight pf all positively marked documents by this value. (1, i.e., no scaling) is default) NOT FULLY IMPLEMENTED
`Lq`	Rescaling to put on the features (2 is standard). Can be from 1 up. Values above 10 invoke an infinity-norm.
`min.support`	Only consider phrases that appear this many times or more.
`min.pattern`	Only consider phrases this long or longer
`max.pattern`	Only consider phrases this short or shorter
`gap`	Allow phrases that have wildcard words in them. Number is how many wildcards in a row.
`token.type`	"word" or "character" as tokens.
`convergence.threshold`	How to decide if descent has converged. (Will go for three steps at this threshold to check for flatness.)

Details

See the bathtub vignette for more complete discussion of this method and the options you might pass to it.

Value

A textreg.result object.

Examples

data( testCorpora )
textreg( testCorpora$testI$corpus, testCorpora$testI$labelI, c(), C=1, verbosity=1 )

[Package textreg version 0.1.5 Index]