woe.tree.binning {woeBinning} | R Documentation |
Binning via Tree-Like Segmentation
Description
woe.tree.binning
generates a supervised tree-like segmentation of numeric variables
and factors with respect to a dichotomous target variable. Its parameters provide
flexibility in finding a binning that fits specific data characteristics and practical
needs.
Usage
woe.tree.binning(df, target.var, pred.var, min.perc.total,
min.perc.class, stop.limit, abbrev.fact.levels, event.class)
Arguments
df |
Name of data frame with input data. |
target.var |
Name of dichotomous target variable in quotes. Only target variables with two distinct values (e.g. 0, 1 or “Y”, “N”) are accepted; cases with NAs in the target variable will be ignored. |
pred.var |
Name of predictor variable(s) to be binned in quotes. A single variable name can be provided, e.g. “varname1”, or a list of variable names, e.g. c(“varname1”, “varname2”). Alternatively one can repeat the name of the input data frame; the function will be applied to all its variables apart from the target variable then. Numeric variables and factors are supported and may contain NAs. |
min.perc.total |
For numeric variables this parameter defines the number of initial classes before any merging or tree-like splitting is applied. For example min.perc.total=0.05 (5%) will result in 20 initial classes. For factors the original levels with a percentage below this limit are collected in a ‘miscellaneous’ level before the merging based on the min.perc.class and the tree-like splitting based on the WOE values starts. Increasing the min.perc.total parameter will avoid sparse bins. Accepted range: 0.0001-0.2; default: 0.01. |
min.perc.class |
If a column percentage of one of the target classes within a bin is below this limit (e.g. below 0.01=1%) then the respective bin will be joined with others. In case of numeric variables adjacent predictor classes are merged. For factors respective levels (including sparse NAs) are assigned to a ‘miscellaneous’ level. Setting min.perc.class>0 may provide more reliable WOE values. Accepted range: 0-0.2; default: 0, i.e. no merging with respect to sparse target classes is applied. |
stop.limit |
Stops WOE based segmentation of the predictor's classes/levels in case the resulting information value (IV) increases less than x% (e.g. 0.05 = 5%) compared to the preceding binning step. Increasing the stop.limit will simplify the binning solution and may avoid overfitting. Accepted range: 0-0.5; default: 0.1. |
abbrev.fact.levels |
Abbreviates the names of new (merged) factor levels via the base R
|
event.class |
Optional parameter for specifying the class of the target event. This class typically indicates a negative event like a loan default or a disease. Use integers (e.g. 1) or characters in quotes (e.g. “bad”). This class will be represented by negative WOE values then. |
Value
woe.tree.binning
generates an object with the information necessary
for studying and applying the realized binning solution. When saved
it can be used with the functions woe.binning.plot
, woe.binning.table
and woe.binning.deploy
.
Binning of Numeric Variables
Numeric variables (continuous and ordinal) are binned beginning with initial classes with similar frequencies. The number of initial bins results from the min.perc.total parameter: min.perc.total will result in trunc(1/min.perc.total) initial bins, whereby trunc is needed to guarantee bins with similar frequencies. For example min.perc.total=0.07 will cause trunc(14.3)=14 initial classes. Next, if min.perc.class>0, bins with sparse target classes will be merged with the next upper bin, and in case of the last bin with the next lower one. NAs have their own bin and will not be merged with others. Finally the actual tree-like procedure starts: binary splits iteratively assign nearby classes with similar weight of evidence (WOE) values to segments in a way that maximizes the resulting information value (IV). The procedure stops when the IV increases less then specified by a percentage value (stop.limit parameter).
Binning of Factors
Factors (categorical variables) are binned via factor levels. As a start sparse levels (defined via the min.perc.total and min.perc.class parameters) are merged to a ‘miscellaneous’ level: if possible, respective levels (including sparse NAs) are bundled as ‘misc. level pos.’ (associated with positive WOE values), respectively as ‘misc. level neg.’ (associated with negative WOE values). In case a misc. level contains only NAs it will be named ‘Missing’. Afterwards the actual tree-like procedure starts: binary splits iteratively assign levels with similar WOE values to segments in a way that maximizes the resulting information value (IV). The procedure stops when the IV increases less then specified by a percentage value (stop.limit parameter).
Adjustment of 0 Frequencies
In case the crosstab of the bins with the target classes contains frequencies = 0 the column percentages are adjusted to be able to compute the WOE and IV values: the offset 0.0001 (=0.01%) is added to each column percentage cell and the column percentages are recomputed then. This allows considering bins associated with one target class only, but may cause extreme WOE values for these bins. If a correction is not appropriate choose min.perc.class>0; bins with sparse target classes will be merged then before computing any WOE or IV value.
Handling of Missing Data
Cases with NAs in the target variable will be ignored. For predictor variables the following applies: in case NAs already occurred when generating the binning solution the code ‘Missing’ is displayed and a corresponding WOE value can be computed. (Note that factor NAs may be joined with other sparse levels to a ‘miscellaneous’ level - see above; only this ‘miscellaneous’ level will be displayed then.) In case NAs occur in the deployment scenario only ‘Missing’ is displayed for numeric variables and ‘unknown’ for factors; and the corresponding WOE values will be NA then, as well.
See Also
Other binning functions: woe.binning
Examples
# Load German credit data and create subset
data(germancredit)
df <- germancredit[, c('creditability', 'credit.amount', 'duration.in.month',
'savings.account.and.bonds', 'purpose')]
# Bin a single numeric variable
binning <- woe.tree.binning(df, 'creditability', 'duration.in.month',
min.perc.total=0.01, min.perc.class=0.01,
stop.limit=0.1, event.class='bad')
# Bin a single factor
binning <- woe.tree.binning(df, 'creditability', 'purpose',
min.perc.total=0.05, min.perc.class=0, stop.limit=0.1,
abbrev.fact.levels=50, event.class='bad')
# Bin two variables (one numeric and one factor)
# with default parameter settings
binning <- woe.tree.binning(df, 'creditability', c('credit.amount','purpose'))
# Bin all variables of the data frame (apart from the target variable)
# with default parameter settings
binning <- woe.tree.binning(df, 'creditability', df)