R: Compute encoding

build_encoding {dataPreparation}

R Documentation

Compute encoding

Description

Build a list of one hot encoding for each cols.

Usage

build_encoding(data_set, cols = "auto", verbose = TRUE, min_frequency = 0, ...)

Arguments

`data_set`	Matrix, data.frame or data.table
`cols`	List of numeric column(s) name(s) of data_set to transform. To transform all characters, set it to "auto". (character, default to "auto")
`verbose`	Should the algorithm talk? (Logical, default to TRUE)
`min_frequency`	The minimal share of lines that a category should represent (numeric, between 0 and 1, default to 0)
`...`	Other arguments such as `name_separator` to separate words in new columns names (character, default to ".")

Details

To avoid creating really large sparce matrices, one can use param min_frequency to be sure that only most representative values will be used to create a new column (and not out-layers or mistakes in data).
Setting min_frequency to something greater than 0 may cause the function to be slower (especially for large data_set).

Value

A list where each element name is a column name of data set and each element new_cols and values the new columns that will be built during encoding.

Examples

# Get a data set
data(adult)
encoding <- build_encoding(adult, cols = "auto", verbose = TRUE)

print(encoding)

# To limit the number of generated columns, one can use min_frequency parameter:
build_encoding(adult, cols = "auto", verbose = TRUE, min_frequency = 0.1)
# Set to 0.1, it will create columns only for values that are present 10% of the time.

[Package dataPreparation version 1.1.1 Index]