R: Tools for Text Classification

textToXY,textToXYpred {regtools}

R Documentation

Tools for Text Classification

Description

"R-style," classification-oriented wrappers for the text2vec package.

Usage

    textToXY(docs,labels,kTop=50,stopWords='a') 
    textToXYpred(ttXYout,predDocs)

Arguments

`docs`	Character vector, one element per document.
`predDocs`	Character vector, one element per document.
`labels`	Class labels, as numeric, character or factor. NULL is used at the prediction stage.
`kTop`	The number of most-frequent words to retain; 0 means retain all.
`stopWords`	Character vector of common words, e.g. prepositions to delete. Recommended is `tm::stopwords('english')`.
`ttXYout`	Output object from `textToXY`.

Details

A typical classification/machine learning package will have as arguments a feature matrix X and a labels vector/factor Y. For a "bag of words" analysis in the text case, each row of X would be a document and each column a word.

The functions here are basically wrappers for generating X. Wrappers are convenient in that:

The text2vec package is rather arcane, so a "R-style" wrapper would be useful.
The text2vec are not directly set up to do classification, so the functions here provide the "glue" to do that.

The typical usage pattern is thus:

Run the documents vector and labels vector/factor through textToXY, generating X and Y.
Apply your favorite classification/machine learning package p to X and Y, returning o.
When predicting a new document d, run o and d through textToXY, producing x.
Run x on p's predict function.

Value

The function textToXY returns an R list with components x and y for X and Y, and a copy of the input stopWords.

The function textToXY returns X.

Author(s)

Norm Matloff

[Package regtools version 1.7.0 Index]