target_encoding_lab {collinear} | R Documentation |
Target encoding of non-numeric variables
Description
Target encoding involves replacing the values of categorical variables with numeric ones from a "target variable", usually a model's response. Target encoding can be useful for improving the performance of machine learning models.
This function identifies categorical variables in the input data frame, and transforms them using a set of target-encoding methods selected by the user, and returns the input data frame with the newly encoded variables.
The target encoding methods implemented in this function are:
"rank": Returns the rank of the group as a integer, starting with 1 as the rank of the group with the lower mean of the response variable. The variables returned by this method are named with the suffix "__encoded_rank". This method is implemented in the function
target_encoding_rank()
."mean": Replaces each value of the categorical variable with the mean of the response across the category the given value belongs to. This option accepts the argument "white_noise" to limit potential overfitting. The variables returned by this method are named with the suffix "__encoded_mean". This method is implemented in the function
target_encoding_mean()
."rnorm": Computes the mean and standard deviation of the response for each group of the categorical variable, and uses
rnorm()
to generate random values from a normal distribution with these parameters. The argumentrnorm_sd_multiplier
is used as a multiplier of the standard deviation to control the range of values produced byrnorm()
for each group of the categorical predictor. The variables returned by this method are named with the suffix "__encoded_rnorm". This method is implemented in the functiontarget_encoding_rnorm()
."loo": This is the leave-one-out method, that replaces each categorical value with the mean of the response variable across the other cases within the same group. This method supports the
white_noise
argument to increase limit potential overfitting. The variables returned by this method are named with the suffix "__encoded_loo". This method is implemented in the functiontarget_encoding_loo()
.
The methods "mean" and "rank" support the white_noise
argument, which is a fraction of the range of the response
variable, and the maximum possible value of white noise to be added. For example, if response
is within 0 and 1, a white_noise
of 0.25 will add to every value of the encoded variable a random number selected from a normal distribution between -0.25 and 0.25. This argument helps control potential overfitting induced by the encoded variable.
The method "rnorm" has the argument rnorm_sd_multiplier
, which multiplies the standard deviation argument of the \link[stats]{rnorm}
function to control the spread of the encoded values between groups. Values smaller than 1 reduce the spread in the results, while values larger than 1 have the opposite effect.
Usage
target_encoding_lab(
df = NULL,
response = NULL,
predictors = NULL,
encoding_methods = c("mean", "rank", "loo", "rnorm"),
smoothing = 0,
rnorm_sd_multiplier = 0,
seed = 1,
white_noise = 0,
replace = FALSE,
verbose = TRUE
)
Arguments
df |
(required; data frame, tibble, or sf) A training data frame. Default: NULL |
response |
(required; character string) Name of the response. Must be a column name of |
predictors |
(required; character vector) Names of all the predictors in |
encoding_methods |
(optional; character string or vector). Name of the target encoding methods. Default: c("mean", "mean_smoothing, "rank", "loo", "rnorm") |
smoothing |
(optional; numeric) Argument of |
rnorm_sd_multiplier |
(optional; numeric) Numeric with multiplier of the standard deviation of each group in the categorical variable, in the range 0-1. Controls the variability in the encoded variables to mitigate potential overfitting. Default: |
seed |
(optional; integer) Random seed to facilitate reproducibility when |
white_noise |
(optional; numeric) Numeric with white noise values in the range 0-1, representing a fraction of the range of the response to be added as noise to the encoded variable. Controls the variability in the encoded variables to mitigate potential overfitting. Default: |
replace |
(optional; logical) If |
verbose |
(optional; logical) If TRUE, messages generated during the execution of the function are printed to the console Default: TRUE |
Value
The input data frame with newly encoded columns if replace
is FALSE
, or the input data frame with encoded columns if TRUE
Author(s)
Blas M. Benito
References
Micci-Barreca, D. (2001) A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1, 27-32 doi:10.1145/507533.507538
Examples
data(
vi,
vi_predictors
)
#subset to limit example run time
vi <- vi[1:1000, ]
#applying all methods for a continuous response
df <- target_encoding_lab(
df = vi,
response = "vi_mean",
predictors = "koppen_zone",
encoding_methods = c(
"mean",
"rank",
"rnorm",
"loo"
),
rnorm_sd_multiplier = c(0, 0.1, 0.2),
white_noise = c(0, 0.1, 0.2)
)
#identify encoded predictors
predictors.encoded <- grep(
pattern = "*__encoded*",
x = colnames(df),
value = TRUE
)
#correlation between encoded predictors and the response
stats::cor(
x = df[["vi_mean"]],
y = df[, predictors.encoded],
use = "pairwise.complete.obs"
)