target_encoding_mean {collinear} | R Documentation |
Target-encoding methods
Description
Methods to apply target-encoding to individual categorical variables. The functions implemented are:
-
target_encoding_mean()
: Each group is identified by the mean of the response over the group cases. The argumentsmoothing
controls pushes the mean of small groups towards the global mean to avoid overfitting. White noise can be added via thewhite_noise
argument. Columns encoded with this function are identified by the suffix "__encoded_mean". Ifwhite_noise
is used, then the amount of white noise is also added to the suffix. -
target_encoding_rank()
: Each group is identified by the rank of the mean of the response variable over the group cases. The group with the lower mean receives the rank 1. White noise can be added via thewhite_noise
argument. Columns encoded with this function are identified by the suffix "__encoded_rank". Ifwhite_noise
is used, then the amount of noise is also added to the suffix. -
target_encoding_rnorm()
: Each case in a group receives a value coming from a normal distribution with the mean and the standard deviation of the response over the cases of the group. The argumentrnorm_sd_multiplier
multiplies the standard deviation to reduce the spread of the obtained values. Columns encoded with this function are identified by the suffix "__encoded_rnorm_rnorm_sd_multiplier_X", where X is the amount ofrnorm_sd_multiplier
used. -
target_encoding_loo()
: The suffix "loo" stands for "leave-one-out". Each case in a group is encoded as the average of the response over the other cases of the group. Columns encoded with this function are identified by the suffix "__encoded_loo".
Usage
target_encoding_mean(
df,
response,
predictor,
smoothing = 0,
white_noise = 0,
seed = 1,
replace = FALSE,
verbose = TRUE
)
target_encoding_rnorm(
df,
response,
predictor,
rnorm_sd_multiplier = 1,
seed = 1,
replace = FALSE,
verbose = TRUE
)
target_encoding_rank(
df,
response,
predictor,
white_noise = 0,
seed = 1,
replace = FALSE,
verbose = TRUE
)
target_encoding_loo(
df,
response,
predictor,
white_noise = 0,
seed = 1,
replace = FALSE,
verbose = TRUE
)
add_white_noise(df, response, predictor, white_noise = 0.1, seed = 1)
Arguments
df |
(required; data frame, tibble, or sf) A training data frame. Default: NULL |
response |
(required; character string) Name of the response. Must be a column name of |
predictor |
(required; character) Name of the categorical variable to encode. Default: NULL |
smoothing |
(optional; numeric) Argument of |
white_noise |
(optional; numeric) Numeric with white noise values in the range 0-1, representing a fraction of the range of the response to be added as noise to the encoded variable. Controls the variability in the encoded variables to mitigate potential overfitting. Default: 0. |
seed |
(optional; integer) Random seed to facilitate reproducibility. Default: 1 |
replace |
(optional; logical) Advanced option that changes the behavior of the function. Use only if you really know exactly what you need. If |
verbose |
(optional; logical) If TRUE, messages and plots generated during the execution of the function are displayed. Default: TRUE |
rnorm_sd_multiplier |
(optional; numeric) Numeric with multiplier of the standard deviation of each group in the categorical variable, in the range 0-1. Controls the variability in the encoded variables to mitigate potential overfitting. Default: 1 |
Value
The input data frame with a target-encoded variable.
Author(s)
Blas M. Benito
References
Micci-Barreca, D. (2001) A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1, 27-32 doi:10.1145/507533.507538
Examples
data(vi)
#subset to limit example run time
vi <- vi[1:1000, ]
#mean encoding
#-------------
#without noise
df <- target_encoding_mean(
df = vi,
response = "vi_mean",
predictor = "soil_type",
replace = TRUE
)
plot(
x = df$soil_type,
y = df$vi_mean,
xlab = "encoded variable",
ylab = "response"
)
#with noise
df <- target_encoding_mean(
df = vi,
response = "vi_mean",
predictor = "soil_type",
white_noise = 0.1,
replace = TRUE
)
plot(
x = df$soil_type,
y = df$vi_mean,
xlab = "encoded variable",
ylab = "response"
)
#group rank
#----------
df <- target_encoding_rank(
df = vi,
response = "vi_mean",
predictor = "soil_type",
replace = TRUE
)
plot(
x = df$soil_type,
y = df$vi_mean,
xlab = "encoded variable",
ylab = "response"
)
#leave-one-out
#-------------
#without noise
df <- target_encoding_loo(
df = vi,
response = "vi_mean",
predictor = "soil_type",
replace = TRUE
)
plot(
x = df$soil_type,
y = df$vi_mean,
xlab = "encoded variable",
ylab = "response"
)
#with noise
df <- target_encoding_loo(
df = vi,
response = "vi_mean",
predictor = "soil_type",
white_noise = 0.1,
replace = TRUE
)
plot(
x = df$soil_type,
y = df$vi_mean,
xlab = "encoded variable",
ylab = "response"
)
#rnorm
#-----
#without sd multiplier
df <- target_encoding_rnorm(
df = vi,
response = "vi_mean",
predictor = "soil_type",
replace = TRUE
)
plot(
x = df$soil_type,
y = df$vi_mean,
xlab = "encoded variable",
ylab = "response"
)
#with sd multiplier
df <- target_encoding_rnorm(
df = vi,
response = "vi_mean",
predictor = "soil_type",
rnorm_sd_multiplier = 0.1,
replace = TRUE
)
plot(
x = df$soil_type,
y = df$vi_mean,
xlab = "encoded variable",
ylab = "response"
)