R: Tokenize Vectors into Words

textTokenizer {lares}

R Documentation

Tokenize Vectors into Words

Description

This function transforms texts into words, calculate frequencies, supress stop words in a given language.

Usage

textTokenizer(
  text,
  exclude = NULL,
  lang = NULL,
  min_word_freq = 5,
  min_word_len = 2,
  keep_spaces = FALSE,
  lowercase = TRUE,
  remove_numbers = TRUE,
  remove_punct = TRUE,
  remove_lettt = TRUE,
  laughs = TRUE,
  utf = TRUE,
  df = FALSE,
  h2o = FALSE,
  quiet = FALSE
)

Arguments

`text`	Character vector. Sentences or texts you wish to tokenize.
`exclude`	Character vector. Which words do you wish to exclude?
`lang`	Character. Language in text (used for stop words). Example: "spanish" or "english". Set to `NA` to ignore.
`min_word_freq`	Integer. This will discard words that appear less than <int> times. Defaults to 2. Set to `NA` to ignore.
`min_word_len`	Integer. This will discard words that have less than <int> characters. Defaults to 5. Set to `NA` to ignore.
`keep_spaces`	Boolean. If you wish to keep spaces in each line to keep unique compound words, separated with spaces, set to TRUE. For example, 'one two' will be set as 'one_two' and treated as a single word.
`lowercase`, `remove_numbers`, `remove_punct`	Boolean.
`remove_lettt`	Boolean. Repeated letters (more than 3 consecutive).
`laughs`	Boolean. Try to unify all laughs texts.
`utf`	Boolean. Transform all characters to UTF (no accents and crazy symbols)
`df`	Boolean. Return a dataframe with a one-hot-encoding kind of results? Each word is a column and returns if word is contained.
`h2o`	Boolean. Return `H2OFrame`?
`quiet`	Boolean. Keep quiet? If not, print messages

Value

data.frame. Tokenized words with counters.

Tokenize Vectors into Words

Description

Usage

Arguments

Value

See Also