step_text_normalization {textrecipes} | R Documentation |
Normalization of Character Variables
Description
step_text_normalization()
creates a specification of a recipe step that
will perform Unicode Normalization on character variables.
Usage
step_text_normalization(
recipe,
...,
role = NA,
trained = FALSE,
columns = NULL,
normalization_form = "nfc",
skip = FALSE,
id = rand_id("text_normalization")
)
Arguments
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which
variables are affected by the step. See |
role |
Not used by this step since no new variables are created. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
columns |
A character string of variable names that will
be populated (eventually) by the |
normalization_form |
A single character string determining the Unicode
Normalization. Must be one of "nfc", "nfd", "nfkd", "nfkc", or
"nfkc_casefold". Defaults to "nfc". See |
skip |
A logical. Should the step be skipped when the
recipe is baked by |
id |
A character string that is unique to this step to identify it. |
Value
An updated version of recipe
with the new step added
to the sequence of existing steps (if any).
Tidying
When you tidy()
this step, a tibble with columns terms
(the selectors or variables selected) and normalization_form
(type of
normalization).
Case weights
The underlying operation does not allow for case weights.
See Also
step_texthash()
for feature hashing.
Examples
library(recipes)
sample_data <- tibble(text = c("sch\U00f6n", "scho\U0308n"))
rec <- recipe(~., data = sample_data) %>%
step_text_normalization(text)
prepped <- rec %>%
prep()
bake(prepped, new_data = NULL, text) %>%
slice(1:2)
bake(prepped, new_data = NULL) %>%
slice(2) %>%
pull(text)
tidy(rec, number = 1)
tidy(prepped, number = 1)