one_hot {nestedcv}R Documentation

One-hot encode

Description

Fast one-hot encoding of all factor and character columns in a dataframe to convert it into a numeric matrix by creating dummy (binary) columns.

Usage

one_hot(x, all_levels = FALSE, rename_binary = TRUE, sep = ".")

Arguments

x

A dataframe, matrix or tibble. Matrices are returned untouched.

all_levels

Logical, whether to create dummy variables for all levels of each factor. Default is FALSE to avoid issues with regression models.

rename_binary

Logical, whether to rename binary factors by appending the 2nd level of the factor to aid interpretation of encoded factor levels and to allow consistency with naming.

sep

Character for separating factor variable names and levels for encoded columns.

Details

Binary factor columns and logical columns are converted to integers (0 or 1). Multi-level unordered factors are converted to multiple columns of 0/1 (dummy variables): if all_levels is set to FALSE (the default), then the first level is assumed to be a reference level and additional columns are created for each additional level; if all_levels is set to TRUE one column is used for each level. Unused levels are dropped. Character columns are first converted to factors and then encoded. Ordered factors are replaced by their internal codes. Numeric or integer columns are left untouched.

Having dummy variables for all levels of a factor can cause problems with multicollinearity in regression (the dummy variable trap), so all_levels is set to FALSE by default which is necessary for regression models such as glmnet (equivalent to full rank parameterisation). However, setting all_levels to TRUE can aid with interpretability (e.g. with SHAP values), and in some cases filtering might result in some dummy variables being excluded. Note this function is designed to quickly generate dummy variables for more general machine learning purposes. To create a proper design matrix object for regression models, use model.matrix().

Value

A numeric matrix with the same number of rows as the input data. Dummy variable columns replace the input factor or character columns. Numeric columns are left intact.

See Also

caret::dummyVars(), model.matrix()

Examples

data(iris)
x <- iris
x2 <- one_hot(x)
head(x2)  # 3 columns for Species

x2 <- one_hot(x, all_levels = FALSE)
head(x2)  # 2 columns for Species


[Package nestedcv version 0.7.9 Index]