ForestDisc {ForestDisc}R Documentation

Multivariate discretization for supervised learning using Random Forest and moment matching optimization

Description

ForestDisc is a supervised, multivariate and non-parametric discretization algorithm based on tree ensembles learning and moment matching optimization. This version of the algorithm relies on random forest algorithm to learn a large set of split points that conserves the relationship between attributes and the target class, and on moment matching optimization to transform this set into a reduced number of cut points matching as well as possible statistical properties of the initial set of split points. For each attribute to be discretized, the set S of its related split points extracted through random forest is mapped to a reduced set C of cut points of size k.

Usage

ForestDisc(data,id_target,ntree=50,max_splits=10,opt_meth="NelderMead")

Arguments

data

Data frame to be discretized.

id_target

Column id of the target class.

ntree

Number of trees to grow using random forest algorithm in order to learn split points. The default value is 50.

max_splits

Maximum number of cut points to be used for discretizing continuous attributes in the data. Possible values for 'max_splits' range between 2 and 10. Default value = 10.

opt_meth

The non-linear optimization algorithm to use in order to get the optimal set of cut points matching as well as possible the set of split points. The possible values are DIviding RECTangles algorithm "directL", NelderMead Simplex method "NelderMead", Sequential Least-Squares Quadratic Programming "SLSQP". (more details about these non-linear optimization algorithms can be found in the documentation of the "NLopt" library). The default value used is "NelderMead".

Value

List with components:

Data_disc

Discretized data.

cont_variables

Continuous attributes column ids.

Listcutp

List of cut points used to discretize continuous attributes.

cut_points

Data frame summarizing the best solution returned.

opt_results

Data frame summarizing all the solutions returned for different realizations. Each realization is determined by a size of the set of cut points, ranging between 2 and 'max_splits'.

Author(s)

Haddouchi Maïssae

Examples

data(iris)
Mydata=iris
id_target=5
set.seed(1234)
Mydata_Disc=ForestDisc(Mydata,id_target)

[Package ForestDisc version 0.1.0 Index]