step_tokenize_bpe {textrecipes} | R Documentation |
BPE Tokenization of Character Variables
Description
step_tokenize_bpe()
creates a specification of a recipe step that will
convert a character predictor into a token
variable using
Byte Pair Encoding.
Usage
step_tokenize_bpe(
recipe,
...,
role = NA,
trained = FALSE,
columns = NULL,
vocabulary_size = 1000,
options = list(),
res = NULL,
skip = FALSE,
id = rand_id("tokenize_bpe")
)
Arguments
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
vocabulary_size |
Integer, indicating the number of tokens in the final vocabulary. Defaults to 1000. Highly encouraged to be tuned. |
options |
A list of options passed to the tokenizer. |
res |
The fitted |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble with columns terms
(the selectors or variables selected).
Tuning Parameters
This step has 1 tuning parameters:
-
vocabulary_size
: # Unique Tokens in Vocabulary (type: integer, default: 1000)
Case weights
The underlying operation does not allow for case weights.
See Also
step_untokenize()
to untokenize.
Other Steps for Tokenization:
step_tokenize_sentencepiece()
,
step_tokenize_wordpiece()
,
step_tokenize()
Examples
library(recipes)
library(modeldata)
data(tate_text)
tate_rec <- recipe(~., data = tate_text) %>%
step_tokenize_bpe(medium)
tate_obj <- tate_rec %>%
prep()
bake(tate_obj, new_data = NULL, medium) %>%
slice(1:2)
bake(tate_obj, new_data = NULL) %>%
slice(2) %>%
pull(medium)
tidy(tate_rec, number = 1)
tidy(tate_obj, number = 1)