| textTokenizer {lares} | R Documentation |
Tokenize Vectors into Words
Description
This function transforms texts into words, calculate frequencies, supress stop words in a given language.
Usage
textTokenizer(
text,
exclude = NULL,
lang = NULL,
min_word_freq = 5,
min_word_len = 2,
keep_spaces = FALSE,
lowercase = TRUE,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_lettt = TRUE,
laughs = TRUE,
utf = TRUE,
df = FALSE,
h2o = FALSE,
quiet = FALSE
)
Arguments
text |
Character vector. Sentences or texts you wish to tokenize. |
exclude |
Character vector. Which words do you wish to exclude? |
lang |
Character. Language in text (used for stop words). Example:
"spanish" or "english". Set to |
min_word_freq |
Integer. This will discard words that appear
less than <int> times. Defaults to 2. Set to |
min_word_len |
Integer. This will discard words that have
less than <int> characters. Defaults to 5. Set to |
keep_spaces |
Boolean. If you wish to keep spaces in each line to keep unique compound words, separated with spaces, set to TRUE. For example, 'one two' will be set as 'one_two' and treated as a single word. |
lowercase, remove_numbers, remove_punct |
Boolean. |
remove_lettt |
Boolean. Repeated letters (more than 3 consecutive). |
laughs |
Boolean. Try to unify all laughs texts. |
utf |
Boolean. Transform all characters to UTF (no accents and crazy symbols) |
df |
Boolean. Return a dataframe with a one-hot-encoding kind of results? Each word is a column and returns if word is contained. |
h2o |
Boolean. Return |
quiet |
Boolean. Keep quiet? If not, print messages |
Value
data.frame. Tokenized words with counters.
See Also
Other Data Wrangling:
balance_data(),
categ_reducer(),
cleanText(),
date_cuts(),
date_feats(),
file_name(),
formatHTML(),
holidays(),
impute(),
left(),
normalize(),
num_abbr(),
ohe_commas(),
ohse(),
quants(),
removenacols(),
replaceall(),
replacefactor(),
textFeats(),
vector2text(),
year_month(),
zerovar()
Other Text Mining:
cleanText(),
ngrams(),
remove_stopwords(),
replaceall(),
sentimentBreakdown(),
textCloud(),
textFeats(),
topics_rake()