textTokenizer {lares} | R Documentation |
Tokenize Vectors into Words
Description
This function transforms texts into words, calculate frequencies, supress stop words in a given language.
Usage
textTokenizer(
text,
exclude = NULL,
lang = NULL,
min_word_freq = 5,
min_word_len = 2,
keep_spaces = FALSE,
lowercase = TRUE,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_lettt = TRUE,
laughs = TRUE,
utf = TRUE,
df = FALSE,
h2o = FALSE,
quiet = FALSE
)
Arguments
text |
Character vector. Sentences or texts you wish to tokenize. |
exclude |
Character vector. Which words do you wish to exclude? |
lang |
Character. Language in text (used for stop words). Example:
"spanish" or "english". Set to |
min_word_freq |
Integer. This will discard words that appear
less than <int> times. Defaults to 2. Set to |
min_word_len |
Integer. This will discard words that have
less than <int> characters. Defaults to 5. Set to |
keep_spaces |
Boolean. If you wish to keep spaces in each line to keep unique compound words, separated with spaces, set to TRUE. For example, 'one two' will be set as 'one_two' and treated as a single word. |
lowercase , remove_numbers , remove_punct |
Boolean. |
remove_lettt |
Boolean. Repeated letters (more than 3 consecutive). |
laughs |
Boolean. Try to unify all laughs texts. |
utf |
Boolean. Transform all characters to UTF (no accents and crazy symbols) |
df |
Boolean. Return a dataframe with a one-hot-encoding kind of results? Each word is a column and returns if word is contained. |
h2o |
Boolean. Return |
quiet |
Boolean. Keep quiet? If not, print messages |
Value
data.frame. Tokenized words with counters.
See Also
Other Data Wrangling:
balance_data()
,
categ_reducer()
,
cleanText()
,
date_cuts()
,
date_feats()
,
file_name()
,
formatHTML()
,
holidays()
,
impute()
,
left()
,
normalize()
,
num_abbr()
,
ohe_commas()
,
ohse()
,
quants()
,
removenacols()
,
replaceall()
,
replacefactor()
,
textFeats()
,
vector2text()
,
year_month()
,
zerovar()
Other Text Mining:
cleanText()
,
ngrams()
,
remove_stopwords()
,
replaceall()
,
sentimentBreakdown()
,
textCloud()
,
textFeats()
,
topics_rake()