R: Logistic regression classifier for texts

textmodel_lr {quanteda.textmodels}

R Documentation

Logistic regression classifier for texts

Description

Fits a fast penalized maximum likelihood estimator to predict discrete categories from sparse dfm objects. Using the glmnet package, the function computes the regularization path for the lasso or elasticnet penalty at a grid of values for the regularization parameter lambda. This is done automatically by testing on several folds of the data at estimation time.

Usage

textmodel_lr(x, y, ...)

Arguments

`x`	the dfm on which the model will be fit. Does not need to contain only the training documents.
`y`	vector of training labels associated with each document identified in `train`. (These will be converted to factors if not already factors.)
`...`	additional arguments passed to `cv.glmnet()`

Value

an object of class textmodel_lr, a list containing:

x, y the input model matrix and input training class labels
algorithm character; the type and family of logistic regression model used in calling cv.glmnet()
type the type of associated with algorithm
classnames the levels of training classes in y
lrfitted the fitted model object from cv.glmnet()
call the model call

References

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software 33(1), 1-22. doi:10.18637/jss.v033.i01

Examples

## Example from 13.1 of _An Introduction to Information Retrieval_
library("quanteda")
corp <- corpus(c(d1 = "Chinese Beijing Chinese",
                 d2 = "Chinese Chinese Shanghai",
                 d3 = "Chinese Macao",
                 d4 = "Tokyo Japan Chinese",
                 d5 = "London England Chinese",
                 d6 = "Chinese Chinese Chinese Tokyo Japan"),
               docvars = data.frame(train = factor(c("Y", "Y", "Y", "N", "N", NA))))
dfmat <- dfm(tokens(corp), tolower = FALSE)

## simulate bigger sample as classification on small samples is problematic
set.seed(1)
dfmat <- dfm_sample(dfmat, 50, replace = TRUE)

## train model
(tmod1 <- textmodel_lr(dfmat, docvars(dfmat, "train")))
summary(tmod1)
coef(tmod1)

## predict probability and classes
predict(tmod1, type = "prob")
predict(tmod1)

[Package quanteda.textmodels version 0.9.7 Index]