encode_categories {categoryEncodings} | R Documentation |
Encode a given factor variable automatically
Description
Transforms the original design matrix automatically, using the appropriate encoding.
Usage
encode_categories(X, Y = NULL, fact = NULL, method = NULL,
keep = FALSE)
Arguments
X |
The data.frame/data.table to transform. |
Y |
Optional: The dependent variable to ignore in the transformation. |
fact |
Optional: The factor variable(s) to encode by - either positive integer(s) specifying the column number, or the name(s) of the column. If left empty a heuristic is used to determine the factor variable(s), and a warning is written with the names of the variables converted. |
method |
Optional: A character string indicating which encoding method to use, either of the following: * "mean" * "median" * "deviation" * "lowrank" * "SPCA" * "mnl" * "dummy" * "difference" * "helmert" * "simple_effect" * "repeated_effect" If only a single method is specified, it is taken to encode either all of the variables supplied through *fact*, or variables which have been flagged as factors automatically. If multiple methods are specified, the number of methods must match the number of factor variables in *fact* - and these are applied to correspond in the order in which they were supplied. In case a missmatch occurs, an error is raised. If left empty, the appriopriate method is selected on a case by case basis (and the selected methods are written out to console). |
keep |
Whether to keep the original factor column(s), defaults to **FALSE**. |
Details
Automatically selects the appropriate method given the number of anticipated newly created variables, based on the results in Johannemann et al.(2019) 'Sufficient Representations for Categorical Variables', and a simple heuristic - where
Value
A new data.table X which contains the new columns and optionally the old factor(s).
Examples
design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ),
sample( sample(letters, 10), 100, replace = TRUE)
)
colnames(design_mat)[6] <- "factor_var"
encode_categories( design_mat, method = "mean" )