| tokenize {lexRankr} | R Documentation |
Tokenize a character vector
Parse the elements of a character vector into a list of cleaned tokens.
Description
Tokenize a character vector
Parse the elements of a character vector into a list of cleaned tokens.
Usage
tokenize(text, removePunc = TRUE, removeNum = TRUE, toLower = TRUE,
stemWords = TRUE, rmStopWords = TRUE)
Arguments
text |
The character vector to be tokenized
|
removePunc |
TRUE or FALSE indicating whether or not to remove punctuation from text. If TRUE, punctuation will be removed. Defaults to TRUE.
|
removeNum |
TRUE or FALSE indicating whether or not to remove numbers from text. If TRUE, numbers will be removed. Defaults to TRUE.
|
toLower |
TRUE or FALSE indicating whether or not to coerce all of text to lowercase. If TRUE, text will be coerced to lowercase. Defaults to TRUE.
|
stemWords |
TRUE or FALSE indicating whether or not to stem resulting tokens. If TRUE, the outputted tokens will be tokenized using SnowballC::wordStem(). Defaults to TRUE.
|
rmStopWords |
TRUE, FALSE, or character vector of stopwords to remove. If TRUE, words in lexRankr::smart_stopwords will be removed prior to stemming. If FALSE, no stopword removal will occur. If a character vector is passed, this vector will be used as the list of stopwords to be removed. Defaults to TRUE.
|
Examples
tokenize("Mr. Feeny said the test would be on Sat. At least I'm 99.9% sure that's what he said.")
tokenize("Bill is trying to earn a Ph.D. in his field.", rmStopWords=FALSE)
[Package
lexRankr version 0.5.2
Index]