mdlp2 {arc}R Documentation

Supervised Discretization

Description

Performs supervised discretization of numeric columns, except class, on the provided data frame. Uses the Minimum Description Length Principle algorithm (Fayyed and Irani, 1993) as implemented in the discretization package.

Usage

mdlp2(
  df,
  cl_index = NULL,
  handle_missing = FALSE,
  labels = FALSE,
  skip_nonnumeric = FALSE,
  infinite_bounds = FALSE,
  min_distinct_values = 3
)

Arguments

df

input data frame.

cl_index

index of the class variable. If not specified, the last column is used as the class variable.

handle_missing

Setting to TRUE activates the following behaviour: if there are any missing observations in the column processed, the input for discretization is a subset of data containing this column and target with rows containing missing values excuded.

labels

A logical indicating whether the bins of the discretized data should be represented by integer codes or as interval notation using (a;b] when set to TRUE.

skip_nonnumeric

If set to TRUE, any non-numeric columns will be skipped.

infinite_bounds

A logical indicating how the bounds on the extremes should look like.

min_distinct_values

If a column contains less than specified number of distinct values, it is not discretized.

Value

Discretized data. If there were any non-numeric input columns they are returned as is. All returned columns except class are factors.

References

Fayyad, U. M. and Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning, Artificial intelligence 13, 1022–1027

Examples

  mdlp2(datasets::iris) #gives the same result as mdlp(datasets::iris) from discretize package
  #uses Sepal.Length as target variable
  mdlp2(df=datasets::iris, cl_index = 1,handle_missing = TRUE, labels = TRUE,
  skip_nonnumeric = TRUE, infinite_bounds = TRUE, min_distinct_values = 30)


[Package arc version 1.4 Index]