mdlp2 {arc} | R Documentation |
Supervised Discretization
Description
Performs supervised discretization of numeric columns, except class, on the provided data frame. Uses the Minimum Description Length Principle algorithm (Fayyed and Irani, 1993) as implemented in the discretization package.
Usage
mdlp2(
df,
cl_index = NULL,
handle_missing = FALSE,
labels = FALSE,
skip_nonnumeric = FALSE,
infinite_bounds = FALSE,
min_distinct_values = 3
)
Arguments
df |
input data frame. |
cl_index |
index of the class variable. If not specified, the last column is used as the class variable. |
handle_missing |
Setting to TRUE activates the following behaviour: if there are any missing observations in the column processed, the input for discretization is a subset of data containing this column and target with rows containing missing values excuded. |
labels |
A logical indicating whether the bins of the discretized data should be represented by integer codes or as interval notation using (a;b] when set to TRUE. |
skip_nonnumeric |
If set to TRUE, any non-numeric columns will be skipped. |
infinite_bounds |
A logical indicating how the bounds on the extremes should look like. |
min_distinct_values |
If a column contains less than specified number of distinct values, it is not discretized. |
Value
Discretized data. If there were any non-numeric input columns they are returned as is. All returned columns except class are factors.
References
Fayyad, U. M. and Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning, Artificial intelligence 13, 1022–1027
Examples
mdlp2(datasets::iris) #gives the same result as mdlp(datasets::iris) from discretize package
#uses Sepal.Length as target variable
mdlp2(df=datasets::iris, cl_index = 1,handle_missing = TRUE, labels = TRUE,
skip_nonnumeric = TRUE, infinite_bounds = TRUE, min_distinct_values = 30)