step_ngram {textrecipes} | R Documentation |
Generate n-grams From Token Variables
Description
step_ngram()
creates a specification of a recipe step that will convert a
token
variable into a token
variable of
ngrams.
Usage
step_ngram(
recipe,
...,
role = NA,
trained = FALSE,
columns = NULL,
num_tokens = 3L,
min_num_tokens = 3L,
delim = "_",
skip = FALSE,
id = rand_id("ngram")
)
Arguments
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
num_tokens |
The number of tokens in the n-gram. This must be an integer greater than or equal to 1. Defaults to 3. |
min_num_tokens |
The minimum number of tokens in the n-gram. This must
be an integer greater than or equal to 1 and smaller than |
delim |
The separator between words in an n-gram. Defaults to "_". |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Details
The use of this step will leave the ordering of the tokens meaningless. If
min_num_tokens < num_tokens
then the tokens order in increasing fashion
with respect to the number of tokens in the n-gram. If min_num_tokens = 1
and num_tokens = 3
then the output contains all the 1-grams followed by all
the 2-grams followed by all the 3-grams.
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble with columns terms
(the selectors or variables selected).
Tuning Parameters
This step has 1 tuning parameters:
-
num_tokens
: Number of tokens (type: integer, default: 3)
Case weights
The underlying operation does not allow for case weights.
See Also
step_tokenize()
to turn characters into tokens
Other Steps for Token Modification:
step_lemma()
,
step_pos_filter()
,
step_stem()
,
step_stopwords()
,
step_tokenfilter()
,
step_tokenmerge()
Examples
library(recipes)
library(modeldata)
data(tate_text)
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize(medium) %>%
step_ngram(medium)
tate_obj <- tate_rec %>%
prep()
bake(tate_obj, new_data = NULL, medium) %>%
slice(1:2)
bake(tate_obj, new_data = NULL) %>%
slice(2) %>%
pull(medium)
tidy(tate_rec, number = 2)
tidy(tate_obj, number = 2)