| target_encoding_lab {collinear} | R Documentation |
Target encoding of non-numeric variables
Description
Target encoding involves replacing the values of categorical variables with numeric ones from a "target variable", usually a model's response. Target encoding can be useful for improving the performance of machine learning models.
This function identifies categorical variables in the input data frame, and transforms them using a set of target-encoding methods selected by the user, and returns the input data frame with the newly encoded variables.
The target encoding methods implemented in this function are:
"rank": Returns the rank of the group as a integer, starting with 1 as the rank of the group with the lower mean of the response variable. The variables returned by this method are named with the suffix "__encoded_rank". This method is implemented in the function
target_encoding_rank()."mean": Replaces each value of the categorical variable with the mean of the response across the category the given value belongs to. This option accepts the argument "white_noise" to limit potential overfitting. The variables returned by this method are named with the suffix "__encoded_mean". This method is implemented in the function
target_encoding_mean()."rnorm": Computes the mean and standard deviation of the response for each group of the categorical variable, and uses
rnorm()to generate random values from a normal distribution with these parameters. The argumentrnorm_sd_multiplieris used as a multiplier of the standard deviation to control the range of values produced byrnorm()for each group of the categorical predictor. The variables returned by this method are named with the suffix "__encoded_rnorm". This method is implemented in the functiontarget_encoding_rnorm()."loo": This is the leave-one-out method, that replaces each categorical value with the mean of the response variable across the other cases within the same group. This method supports the
white_noiseargument to increase limit potential overfitting. The variables returned by this method are named with the suffix "__encoded_loo". This method is implemented in the functiontarget_encoding_loo().
The methods "mean" and "rank" support the white_noise argument, which is a fraction of the range of the response variable, and the maximum possible value of white noise to be added. For example, if response is within 0 and 1, a white_noise of 0.25 will add to every value of the encoded variable a random number selected from a normal distribution between -0.25 and 0.25. This argument helps control potential overfitting induced by the encoded variable.
The method "rnorm" has the argument rnorm_sd_multiplier, which multiplies the standard deviation argument of the \link[stats]{rnorm} function to control the spread of the encoded values between groups. Values smaller than 1 reduce the spread in the results, while values larger than 1 have the opposite effect.
Usage
target_encoding_lab(
df = NULL,
response = NULL,
predictors = NULL,
encoding_methods = c("mean", "rank", "loo", "rnorm"),
smoothing = 0,
rnorm_sd_multiplier = 0,
seed = 1,
white_noise = 0,
replace = FALSE,
verbose = TRUE
)
Arguments
df |
(required; data frame, tibble, or sf) A training data frame. Default: NULL |
response |
(required; character string) Name of the response. Must be a column name of |
predictors |
(required; character vector) Names of all the predictors in |
encoding_methods |
(optional; character string or vector). Name of the target encoding methods. Default: c("mean", "mean_smoothing, "rank", "loo", "rnorm") |
smoothing |
(optional; numeric) Argument of |
rnorm_sd_multiplier |
(optional; numeric) Numeric with multiplier of the standard deviation of each group in the categorical variable, in the range 0-1. Controls the variability in the encoded variables to mitigate potential overfitting. Default: |
seed |
(optional; integer) Random seed to facilitate reproducibility when |
white_noise |
(optional; numeric) Numeric with white noise values in the range 0-1, representing a fraction of the range of the response to be added as noise to the encoded variable. Controls the variability in the encoded variables to mitigate potential overfitting. Default: |
replace |
(optional; logical) If |
verbose |
(optional; logical) If TRUE, messages generated during the execution of the function are printed to the console Default: TRUE |
Value
The input data frame with newly encoded columns if replace is FALSE, or the input data frame with encoded columns if TRUE
Author(s)
Blas M. Benito
References
Micci-Barreca, D. (2001) A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1, 27-32 doi:10.1145/507533.507538
Examples
data(
vi,
vi_predictors
)
#subset to limit example run time
vi <- vi[1:1000, ]
#applying all methods for a continuous response
df <- target_encoding_lab(
df = vi,
response = "vi_mean",
predictors = "koppen_zone",
encoding_methods = c(
"mean",
"rank",
"rnorm",
"loo"
),
rnorm_sd_multiplier = c(0, 0.1, 0.2),
white_noise = c(0, 0.1, 0.2)
)
#identify encoded predictors
predictors.encoded <- grep(
pattern = "*__encoded*",
x = colnames(df),
value = TRUE
)
#correlation between encoded predictors and the response
stats::cor(
x = df[["vi_mean"]],
y = df[, predictors.encoded],
use = "pairwise.complete.obs"
)