R: Filter rare categories

remove_rare_categorical {dataPreparation}

R Documentation

Filter rare categories

Description

Filter rows that have a rare occurrences

Usage

remove_rare_categorical(
  data_set,
  cols = "auto",
  threshold = 0.01,
  verbose = TRUE
)

Arguments

`data_set`	Matrix, data.frame or data.table
`cols`	List of column(s) name(s) of data_set to transform. To transform all columns, set it to "auto". (character, default to "auto")
`threshold`	share of occurrences under which row should be removed (numeric, default to 0.01)
`verbose`	Should the algorithm talk? (logical, default to TRUE)

Details

Filtering is made column by column, meaning that extreme values from first element of cols are removed, then extreme values from second element of cols are removed, ...
So if filtering is performed on too many column, there ia high risk that a lot of rows will be dropped.

Value

Same dataset with less rows, edited by reference.
If you don't want to edit by reference please provide set data_set = copy(data_set).

Examples

# Given a set with rare "C"
library(data.table)
data_set <- data.table(cat_col = c(sample(c("A", "B"), 1000, replace=TRUE), "C"))

# When calling function
data_set <- remove_rare_categorical(data_set, cols = "cat_col",
                                   threshold = 0.01, verbose = TRUE)

# Then there are no "C"
unique(data_set[["cat_col"]])

[Package dataPreparation version 1.1.1 Index]