categ_reducer {lares}R Documentation

Reduce categorical values

Description

This function lets the user reduce categorical values in a vector. It is tidyverse friendly for use on pipelines

Usage

categ_reducer(
  df,
  var,
  nmin = 0,
  pmin = 0,
  pcummax = 100,
  top = NA,
  pvalue_max = 1,
  cor_var = "tag",
  limit = 20,
  other_label = "other",
  ...
)

Arguments

df

Categorical Vector

var

Variable. Which variable do you wish to reduce?

nmin

Integer. Number of minimum times a value is repeated

pmin

Numerical. Percentage of minimum times a value is repeated

pcummax

Numerical. Top cumulative percentage of most repeated values

top

Integer. Keep the n most frequently repeated values

pvalue_max

Numeric (0-1]. Max pvalue categories

cor_var

Character. If pvalue_max < 1, you must define which column name will be compared with (numerical or binary).

limit

Integer. Limit one hot encoding to the n most frequent values of each column. Set to NA to ignore argument.

other_label

Character. With which text do you wish to replace the filtered values with?

...

Additional parameters.

Value

data.frame df on which var has been transformed

See Also

Other Data Wrangling: balance_data(), cleanText(), date_cuts(), date_feats(), file_name(), formatHTML(), holidays(), impute(), left(), normalize(), num_abbr(), ohe_commas(), ohse(), quants(), removenacols(), replaceall(), replacefactor(), textFeats(), textTokenizer(), vector2text(), year_month(), zerovar()

Examples

data(dft) # Titanic dataset
categ_reducer(dft, Embarked, top = 2) %>% freqs(Embarked)
categ_reducer(dft, Ticket, nmin = 7, other_label = "Other Ticket") %>% freqs(Ticket)
categ_reducer(dft, Ticket, pvalue_max = 0.05, cor_var = "Survived") %>% freqs(Ticket)

[Package lares version 5.2.8 Index]