target_encoding_mean {collinear}R Documentation

Target-encoding methods

Description

Methods to apply target-encoding to individual categorical variables. The functions implemented are:

Usage

target_encoding_mean(
  df,
  response,
  predictor,
  smoothing = 0,
  white_noise = 0,
  seed = 1,
  replace = FALSE,
  verbose = TRUE
)

target_encoding_rnorm(
  df,
  response,
  predictor,
  rnorm_sd_multiplier = 1,
  seed = 1,
  replace = FALSE,
  verbose = TRUE
)

target_encoding_rank(
  df,
  response,
  predictor,
  white_noise = 0,
  seed = 1,
  replace = FALSE,
  verbose = TRUE
)

target_encoding_loo(
  df,
  response,
  predictor,
  white_noise = 0,
  seed = 1,
  replace = FALSE,
  verbose = TRUE
)

add_white_noise(df, response, predictor, white_noise = 0.1, seed = 1)

Arguments

df

(required; data frame, tibble, or sf) A training data frame. Default: NULL

response

(required; character string) Name of the response. Must be a column name of df. Default: NULL

predictor

(required; character) Name of the categorical variable to encode. Default: NULL

smoothing

(optional; numeric) Argument of target_encoding_mean(). Minimum group size that keeps the mean of the group. Groups smaller than this have their means pulled towards the global mean of the response. Default: 0.

white_noise

(optional; numeric) Numeric with white noise values in the range 0-1, representing a fraction of the range of the response to be added as noise to the encoded variable. Controls the variability in the encoded variables to mitigate potential overfitting. Default: 0.

seed

(optional; integer) Random seed to facilitate reproducibility. Default: 1

replace

(optional; logical) Advanced option that changes the behavior of the function. Use only if you really know exactly what you need. If TRUE, it replaces each categorical variable with its encoded version, and returns the input data frame with the replaced variables.

verbose

(optional; logical) If TRUE, messages and plots generated during the execution of the function are displayed. Default: TRUE

rnorm_sd_multiplier

(optional; numeric) Numeric with multiplier of the standard deviation of each group in the categorical variable, in the range 0-1. Controls the variability in the encoded variables to mitigate potential overfitting. Default: 1

Value

The input data frame with a target-encoded variable.

Author(s)

Blas M. Benito

References

Examples


data(vi)

#subset to limit example run time
vi <- vi[1:1000, ]

#mean encoding
#-------------

#without noise
df <- target_encoding_mean(
  df = vi,
  response = "vi_mean",
  predictor = "soil_type",
  replace = TRUE
)

plot(
  x = df$soil_type,
  y = df$vi_mean,
  xlab = "encoded variable",
  ylab = "response"
)

#with noise
df <- target_encoding_mean(
  df = vi,
  response = "vi_mean",
  predictor = "soil_type",
  white_noise = 0.1,
  replace = TRUE
)

plot(
  x = df$soil_type,
  y = df$vi_mean,
  xlab = "encoded variable",
  ylab = "response"
)


#group rank
#----------

df <- target_encoding_rank(
  df = vi,
  response = "vi_mean",
  predictor = "soil_type",
  replace = TRUE
)

plot(
  x = df$soil_type,
  y = df$vi_mean,
  xlab = "encoded variable",
  ylab = "response"
)


#leave-one-out
#-------------

#without noise
df <- target_encoding_loo(
  df = vi,
  response = "vi_mean",
  predictor = "soil_type",
  replace = TRUE
)

plot(
  x = df$soil_type,
  y = df$vi_mean,
  xlab = "encoded variable",
  ylab = "response"
)

#with noise
df <- target_encoding_loo(
  df = vi,
  response = "vi_mean",
  predictor = "soil_type",
  white_noise = 0.1,
  replace = TRUE
)

plot(
  x = df$soil_type,
  y = df$vi_mean,
  xlab = "encoded variable",
  ylab = "response"
)


#rnorm
#-----

#without sd multiplier
df <- target_encoding_rnorm(
  df = vi,
  response = "vi_mean",
  predictor = "soil_type",
  replace = TRUE
)

plot(
  x = df$soil_type,
  y = df$vi_mean,
  xlab = "encoded variable",
  ylab = "response"
)

#with sd multiplier
df <- target_encoding_rnorm(
  df = vi,
  response = "vi_mean",
  predictor = "soil_type",
  rnorm_sd_multiplier = 0.1,
  replace = TRUE
)

plot(
  x = df$soil_type,
  y = df$vi_mean,
  xlab = "encoded variable",
  ylab = "response"
)



[Package collinear version 1.1.1 Index]