bin_cols {tidybins}R Documentation

Bin Cols

Description

Make bins in a tidy fashion. Adds a column to your data frame containing the integer codes of the specified bins of a certain column. Specifying multiple columns is only intended for supervised binning, so mutliple columns can be simultaneously binned optimally with respect to a target variable.

Usage

bin_cols(
  .data,
  col,
  n_bins = 10,
  bin_type = "frequency",
  ...,
  target = NULL,
  pretty_labels = FALSE,
  seed = 1,
  method = "mdlp"
)

Arguments

.data

a data frame

col

a column, vector of columns, or tidyselect

n_bins

number of bins

bin_type

method to make bins

...

params to be passed to selected binning method

target

unquoted column for supervised binning

pretty_labels

logical. If T returns interval label rather than integer rank

seed

seed for stochastic binning (xgboost)

method

method for bin mdlp

Details

Description of the arguments for bin_type

frequency (fr)

creates bins of equal content via quantiles. Wraps bin with method "content". Similar to ntile

width (wi)

create bins of equal numeric width. Wraps bin with method "length"

kmeans (km)

create bins using 1-dimensional kmeans. Wraps bin with method "clusters"

value (va)

each bin has equal sum of values

xgboost (xg)

column is binned by best predictor of a target column using step_discretize_xgb

cart (ca)

if the col does not have enough distinct values, xgboost will fail and automatically revert to step_discretize_cart

woe (wo)

column is binned by weight of evidence. Requires binary target

logreg (lr)

column is binned by logistic regression. Requires binary target.

mdlp

uses the discretizeDF.supervised algorithm with a variety of methods.

Value

a data frame

Examples


iris %>%
bin_cols(Sepal.Width, n_bins = 5, pretty_labels = TRUE) %>%
bin_cols(Petal.Width, n_bins = 3, bin_type = c("width", "kmeans")) %>%
bin_cols(Sepal.Width, bin_type = "xgboost", target = Species, seed = 1) -> iris1

#binned columns are named by original name + method abbreviation + number bins created.
#Sometimes the actual number of bins is less than n_bins if the col lacks enough variance.
iris1 %>%
print(width = Inf)

iris1 %>%
bin_summary() %>%
print(width = Inf)

[Package tidybins version 0.1.1 Index]