step_tokenfilter {textrecipes} | R Documentation |
Filter Tokens Based on Term Frequency
Description
step_tokenfilter()
creates a specification of a recipe step that will
convert a token
variable to be filtered based on frequency.
Usage
step_tokenfilter(
recipe,
...,
role = NA,
trained = FALSE,
columns = NULL,
max_times = Inf,
min_times = 0,
percentage = FALSE,
max_tokens = 100,
filter_fun = NULL,
res = NULL,
skip = FALSE,
id = rand_id("tokenfilter")
)
Arguments
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
max_times |
An integer. Maximal number of times a word can appear before getting removed. |
min_times |
An integer. Minimum number of times a word can appear before getting removed. |
percentage |
A logical. Should max_times and min_times be interpreted as a percentage instead of count. |
max_tokens |
An integer. Will only keep the top max_tokens tokens after filtering done by max_times and min_times. Defaults to 100. |
filter_fun |
A function. This function should take a vector of
characters, and return a logical vector of the same length. This function
will be applied to each observation of the data set. Defaults to |
res |
The words that will be keep will be stored here once this
preprocessing step has be trained by |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Details
This step allow you to limit the tokens you are looking at by filtering on
their occurrence in the corpus. You are able to exclude tokens if they appear
too many times or too few times in the data. It can be specified as counts
using max_times
and min_times
or as percentages by setting percentage
as TRUE
. In addition one can filter to only use the top max_tokens
used
tokens. If max_tokens
is set to Inf
then all the tokens will be used.
This will generally lead to very large data sets when then tokens are words
or trigrams. A good strategy is to start with a low token count and go up
according to how much RAM you want to use.
It is strongly advised to filter before using step_tf or step_tfidf to limit the number of variables created.
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble with columns terms
(the selectors or variables selected) and value
(number of unique tokens).
Tuning Parameters
This step has 3 tuning parameters:
-
max_times
: Maximum Token Frequency (type: integer, default: Inf) -
min_times
: Minimum Token Frequency (type: integer, default: 0) -
max_tokens
: # Retained Tokens (type: integer, default: 100)
Case weights
The underlying operation does not allow for case weights.
See Also
step_tokenize()
to turn characters into tokens
Other Steps for Token Modification:
step_lemma()
,
step_ngram()
,
step_pos_filter()
,
step_stem()
,
step_stopwords()
,
step_tokenmerge()
Examples
library(recipes)
library(modeldata)
data(tate_text)
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize(medium) %>%
step_tokenfilter(medium)
tate_obj <- tate_rec %>%
prep()
bake(tate_obj, new_data = NULL, medium) %>%
slice(1:2)
bake(tate_obj, new_data = NULL) %>%
slice(2) %>%
pull(medium)
tidy(tate_rec, number = 2)
tidy(tate_obj, number = 2)