R: Target encoding of non-numeric variables

target_encoding_lab {collinear}

R Documentation

Target encoding of non-numeric variables

Description

Target encoding involves replacing the values of categorical variables with numeric ones from a "target variable", usually a model's response. Target encoding can be useful for improving the performance of machine learning models.

This function identifies categorical variables in the input data frame, and transforms them using a set of target-encoding methods selected by the user, and returns the input data frame with the newly encoded variables.

The target encoding methods implemented in this function are:

"rank": Returns the rank of the group as a integer, starting with 1 as the rank of the group with the lower mean of the response variable. The variables returned by this method are named with the suffix "__encoded_rank". This method is implemented in the function target_encoding_rank().
"mean": Replaces each value of the categorical variable with the mean of the response across the category the given value belongs to. This option accepts the argument "white_noise" to limit potential overfitting. The variables returned by this method are named with the suffix "__encoded_mean". This method is implemented in the function target_encoding_mean().
"rnorm": Computes the mean and standard deviation of the response for each group of the categorical variable, and uses rnorm() to generate random values from a normal distribution with these parameters. The argument rnorm_sd_multiplier is used as a multiplier of the standard deviation to control the range of values produced by rnorm() for each group of the categorical predictor. The variables returned by this method are named with the suffix "__encoded_rnorm". This method is implemented in the function target_encoding_rnorm().
"loo": This is the leave-one-out method, that replaces each categorical value with the mean of the response variable across the other cases within the same group. This method supports the white_noise argument to increase limit potential overfitting. The variables returned by this method are named with the suffix "__encoded_loo". This method is implemented in the function target_encoding_loo().

The methods "mean" and "rank" support the white_noise argument, which is a fraction of the range of the response variable, and the maximum possible value of white noise to be added. For example, if response is within 0 and 1, a white_noise of 0.25 will add to every value of the encoded variable a random number selected from a normal distribution between -0.25 and 0.25. This argument helps control potential overfitting induced by the encoded variable.

The method "rnorm" has the argument rnorm_sd_multiplier, which multiplies the standard deviation argument of the ⁠\link[stats]{rnorm}⁠ function to control the spread of the encoded values between groups. Values smaller than 1 reduce the spread in the results, while values larger than 1 have the opposite effect.

Usage

target_encoding_lab(
  df = NULL,
  response = NULL,
  predictors = NULL,
  encoding_methods = c("mean", "rank", "loo", "rnorm"),
  smoothing = 0,
  rnorm_sd_multiplier = 0,
  seed = 1,
  white_noise = 0,
  replace = FALSE,
  verbose = TRUE
)

Arguments

`df`	(required; data frame, tibble, or sf) A training data frame. Default: NULL
`response`	(required; character string) Name of the response. Must be a column name of `df`. Default: NULL
`predictors`	(required; character vector) Names of all the predictors in `df`. Only character and factor predictors are processed, but all are returned in the "df" slot of the function's output. Default: NULL
`encoding_methods`	(optional; character string or vector). Name of the target encoding methods. Default: c("mean", "mean_smoothing, "rank", "loo", "rnorm")
`smoothing`	(optional; numeric) Argument of `target_encoding_mean()` (method "mean_smoothing"). Minimum group size that keeps the mean of the group. Groups smaller than this have their means pulled towards the global mean of the response. Default: 0
`rnorm_sd_multiplier`	(optional; numeric) Numeric with multiplier of the standard deviation of each group in the categorical variable, in the range 0-1. Controls the variability in the encoded variables to mitigate potential overfitting. Default: `1`
`seed`	(optional; integer) Random seed to facilitate reproducibility when `white_noise` is not 0. Default: 1
`white_noise`	(optional; numeric) Numeric with white noise values in the range 0-1, representing a fraction of the range of the response to be added as noise to the encoded variable. Controls the variability in the encoded variables to mitigate potential overfitting. Default: `0`.
`replace`	(optional; logical) If `TRUE`, the function replaces each categorical variable with its encoded version, and returns the input data frame with the encoded variables instead of the original ones. Default: FALSE
`verbose`	(optional; logical) If TRUE, messages generated during the execution of the function are printed to the console Default: TRUE

Value

The input data frame with newly encoded columns if replace is FALSE, or the input data frame with encoded columns if TRUE

Author(s)

Blas M. Benito

References

Micci-Barreca, D. (2001) A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1, 27-32 doi:10.1145/507533.507538

Examples


data(
  vi,
  vi_predictors
  )

#subset to limit example run time
vi <- vi[1:1000, ]

#applying all methods for a continuous response
df <- target_encoding_lab(
  df = vi,
  response = "vi_mean",
  predictors = "koppen_zone",
  encoding_methods = c(
    "mean",
    "rank",
    "rnorm",
    "loo"
  ),
  rnorm_sd_multiplier = c(0, 0.1, 0.2),
  white_noise = c(0, 0.1, 0.2)
)

#identify encoded predictors
predictors.encoded <- grep(
  pattern = "*__encoded*",
  x = colnames(df),
  value = TRUE
)

#correlation between encoded predictors and the response
stats::cor(
  x = df[["vi_mean"]],
  y = df[, predictors.encoded],
  use = "pairwise.complete.obs"
)

[Package collinear version 1.1.1 Index]