R: Encode a given factor variable automatically

encode_categories {categoryEncodings}

R Documentation

Encode a given factor variable automatically

Description

Transforms the original design matrix automatically, using the appropriate encoding.

Usage

encode_categories(X, Y = NULL, fact = NULL, method = NULL,
  keep = FALSE)

Arguments

`X`	The data.frame/data.table to transform.
`Y`	Optional: The dependent variable to ignore in the transformation.
`fact`	Optional: The factor variable(s) to encode by - either positive integer(s) specifying the column number, or the name(s) of the column. If left empty a heuristic is used to determine the factor variable(s), and a warning is written with the names of the variables converted.
`method`	Optional: A character string indicating which encoding method to use, either of the following: * "mean" * "median" * "deviation" * "lowrank" * "SPCA" * "mnl" * "dummy" * "difference" * "helmert" * "simple_effect" * "repeated_effect" If only a single method is specified, it is taken to encode either all of the variables supplied through fact, or variables which have been flagged as factors automatically. If multiple methods are specified, the number of methods must match the number of factor variables in fact - and these are applied to correspond in the order in which they were supplied. In case a missmatch occurs, an error is raised. If left empty, the appriopriate method is selected on a case by case basis (and the selected methods are written out to console).
`keep`	Whether to keep the original factor column(s), defaults to FALSE.

Details

Automatically selects the appropriate method given the number of anticipated newly created variables, based on the results in Johannemann et al.(2019) 'Sufficient Representations for Categorical Variables', and a simple heuristic - where

Value

A new data.table X which contains the new columns and optionally the old factor(s).

Examples


design_mat <- cbind( data.frame( matrix(rnorm(5*100),ncol = 5) ),
                     sample( sample(letters, 10), 100, replace = TRUE)
                     )
colnames(design_mat)[6] <- "factor_var"

encode_categories( design_mat, method = "mean" )

[Package categoryEncodings version 1.4.3 Index]