one_hot {nestedcv} | R Documentation |
One-hot encode
Description
Fast one-hot encoding of all factor and character columns in a dataframe to convert it into a numeric matrix by creating dummy (binary) columns.
Usage
one_hot(x, all_levels = FALSE, rename_binary = TRUE, sep = ".")
Arguments
x |
A dataframe, matrix or tibble. Matrices are returned untouched. |
all_levels |
Logical, whether to create dummy variables for all levels
of each factor. Default is |
rename_binary |
Logical, whether to rename binary factors by appending the 2nd level of the factor to aid interpretation of encoded factor levels and to allow consistency with naming. |
sep |
Character for separating factor variable names and levels for encoded columns. |
Details
Binary factor columns and logical columns are converted to integers (0 or
1). Multi-level unordered factors are converted to multiple columns of 0/1
(dummy variables): if all_levels
is set to FALSE
(the default), then the
first level is assumed to be a reference level and additional columns are
created for each additional level; if all_levels
is set to TRUE
one
column is used for each level. Unused levels are dropped. Character columns
are first converted to factors and then encoded. Ordered factors are
replaced by their internal codes. Numeric or integer columns are left
untouched.
Having dummy variables for all levels of a factor can cause problems with
multicollinearity in regression (the dummy variable trap), so all_levels
is set to FALSE
by default which is necessary for regression models such
as glmnet
(equivalent to full rank parameterisation). However, setting
all_levels
to TRUE
can aid with interpretability (e.g. with SHAP
values), and in some cases filtering might result in some dummy variables
being excluded. Note this function is designed to quickly generate dummy
variables for more general machine learning purposes. To create a proper
design matrix object for regression models, use model.matrix()
.
Value
A numeric matrix with the same number of rows as the input data. Dummy variable columns replace the input factor or character columns. Numeric columns are left intact.
See Also
caret::dummyVars()
, model.matrix()
Examples
data(iris)
x <- iris
x2 <- one_hot(x)
head(x2) # 3 columns for Species
x2 <- one_hot(x, all_levels = FALSE)
head(x2) # 2 columns for Species