R: Extract the complexity measures from datasets

complexity {ECoL}

R Documentation

Extract the complexity measures from datasets

Description

This function is responsable to extract the complexity measures from the classification and regression tasks. For such, they take into account the overlap between classes imposed by feature values, the separability and distribution of the data points and the value of structural measures based on the representation of the dataset as a graph structure. To set specific parameters for each group, use the characterization function.

Usage

complexity(...)

## Default S3 method:
complexity(x, y, groups = "all", summary = c("mean",
  "sd"), ...)

## S3 method for class 'formula'
complexity(formula, data, groups = "all",
  summary = c("mean", "sd"), ...)

Arguments

`...`	Not used.
`x`	A data.frame contained only the input attributes.
`y`	A response vector with one value for each row/component of x.
`groups`	A list of complexity measures groups or `"all"` to include all of them.
`summary`	A list of summarization functions or empty for all values. See summarization method to more information. (Default: `c("mean", "sd")`)
`formula`	A formula to define the output column.
`data`	A data.frame dataset contained the input and output attributes.

Details

The following groups are allowed for this method:

"overlapping": The feature overlapping measures characterize how informative the available features are to separate the classes See overlapping for more details.
"neighborhood": Neighborhood measures characterize the presence and density of same or different classes in local neighborhoods. See neighborhood for more details.
"linearity": Linearity measures try to quantify whether the labels can be linearly separated. See linearity for more details.
"dimensionality": The dimensionality measures compute information on how smoothly the examples are distributed within the attributes. See dimensionality for more details.
"balance": Class balance measures take into account the numbers of examples per class in the dataset. See balance for more details.
"network": Network measures represent the dataset as a graph and extract structural information from it. See network for more details.
"correlation": Capture the relationship of the feature values with the outputs. See correlation for more details.
"smoothness": Estimate the smoothness of the function that must be fitted to the data. See smoothness for more details.

Value

A numeric vector named by the requested complexity measures.

References

Tin K Ho and Mitra Basu. (2002). Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 3, 289–300.

Albert Orriols-Puig, Nuria Macia and Tin K Ho. (2010). Documentation for the data complexity library in C++. Technical Report. La Salle - Universitat Ramon Llull.

Ana C Lorena and Aron I Maciel and Pericles B C Miranda and Ivan G Costa and Ricardo B C Prudencio. (2018). Data complexity meta-features for regression problems. Machine Learning, 107, 1, 209–246.

Examples

## Extract all complexity measures for classification task
data(iris)
complexity(Species ~ ., iris)

## Extract all complexity measures for regression task
data(cars)
complexity(speed ~ ., cars)

[Package ECoL version 0.3.0 Index]