tokenize {lexRankr} | R Documentation |
Tokenize a character vector
Parse the elements of a character vector into a list of cleaned tokens.
Description
Tokenize a character vector
Parse the elements of a character vector into a list of cleaned tokens.
Usage
tokenize(text, removePunc = TRUE, removeNum = TRUE, toLower = TRUE,
stemWords = TRUE, rmStopWords = TRUE)
Arguments
text |
The character vector to be tokenized
|
removePunc |
TRUE or FALSE indicating whether or not to remove punctuation from text . If TRUE , punctuation will be removed. Defaults to TRUE .
|
removeNum |
TRUE or FALSE indicating whether or not to remove numbers from text . If TRUE , numbers will be removed. Defaults to TRUE .
|
toLower |
TRUE or FALSE indicating whether or not to coerce all of text to lowercase. If TRUE , text will be coerced to lowercase. Defaults to TRUE .
|
stemWords |
TRUE or FALSE indicating whether or not to stem resulting tokens. If TRUE , the outputted tokens will be tokenized using SnowballC::wordStem() . Defaults to TRUE .
|
rmStopWords |
TRUE , FALSE , or character vector of stopwords to remove. If TRUE , words in lexRankr::smart_stopwords will be removed prior to stemming. If FALSE , no stopword removal will occur. If a character vector is passed, this vector will be used as the list of stopwords to be removed. Defaults to TRUE .
|
Examples
tokenize("Mr. Feeny said the test would be on Sat. At least I'm 99.9% sure that's what he said.")
tokenize("Bill is trying to earn a Ph.D. in his field.", rmStopWords=FALSE)
[Package
lexRankr version 0.5.2
Index]