textToXY,textToXYpred {regtools} | R Documentation |
Tools for Text Classification
Description
"R-style," classification-oriented wrappers for the text2vec package.
Usage
textToXY(docs,labels,kTop=50,stopWords='a')
textToXYpred(ttXYout,predDocs)
Arguments
docs |
Character vector, one element per document. |
predDocs |
Character vector, one element per document. |
labels |
Class labels, as numeric, character or factor. NULL is used at the prediction stage. |
kTop |
The number of most-frequent words to retain; 0 means retain all. |
stopWords |
Character vector of common words, e.g. prepositions
to delete. Recommended is |
ttXYout |
Output object from |
Details
A typical classification/machine learning package will have as arguments a feature matrix X and a labels vector/factor Y. For a "bag of words" analysis in the text case, each row of X would be a document and each column a word.
The functions here are basically wrappers for generating X. Wrappers are convenient in that:
The text2vec package is rather arcane, so a "R-style" wrapper would be useful.
The text2vec are not directly set up to do classification, so the functions here provide the "glue" to do that.
The typical usage pattern is thus:
Run the documents vector and labels vector/factor through
textToXY
, generating X and Y.Apply your favorite classification/machine learning package p to X and Y, returning o.
When predicting a new document d, run o and d through
textToXY
, producing x.Run x on p's
predict
function.
Value
The function textToXY
returns an R list with components
x
and y
for X and Y, and a copy of the input
stopWords
.
The function textToXY
returns X.
Author(s)
Norm Matloff