select.fast.filter {Biocomb}R Documentation

Select the subset of features

Description

This function selects the subset of features on the basis of the fast correlation-based filter method (FCBF). It can handle both numerical and nominal values. At first it performs the discretization of the numerical features values, according to several optional discretization methods using the function ProcessData. A fast filter can identify relevant features as well as redundancy among relevant features without pairwise correlation analysis. The overall complexity of FCBF is O(MN logN), where M - number of samples, N - number of features.The results is in the form of “data.frame”, consisting of the features (Biomarker) names, values of the information gain and the positions of the features in the dataset. The information gain value is the correlation between the features and the class. This function is used internally to perform the classification with feature selection using the function “classifier.loop” with argument “FastFilter” for feature selection. The variable “NumberFeature” of the data.frame is passed to the classification function.

Usage

select.fast.filter(matrix,disc.method,threshold,attrs.nominal)

Arguments

matrix

a dataset, a matrix of feature values for several cases, the last column is for the class labels. Class labels could be numerical or character values. The maximal number of classes is ten.

disc.method

a method used for feature discretization.The discretization options include minimal description length (MDL), equal frequency and equal interval width methods.

threshold

a numeric threshold value for the correlation of feature with class to be included in the final subset.

attrs.nominal

a numerical vector, containing the column numbers of the nominal features, selected for the analysis.

Details

This function's main job is to select the subset of informative features according to correlation between features and class, and between features themselves. See the “Value” section to this page for more details. Before starting it calls the ProcessData function to make the discretization of numerical features.

Data can be provided in matrix form, where the rows correspond to cases with feature values and class label. The columns contain the values of individual features and the last column must contain class labels. The maximal number of class labels equals 10. The class label features and all the nominal features must be defined as factors.

Value

The data can be provided with reasonable number of missing values that must be at first preprocessed with one of the imputing methods in the function input_miss.

A returned data.frame consists of the the following fields:

Biomarker

a character vector of feature names

Information.Gain

a numeric vector of information gain values for the features according to class

NumberFeature

a numerical vector of the positions of the features in the dataset

References

L. Yu and H. Liu. "Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution". In Proceedings of The Twentieth International Conference on Machine Leaning (ICML-03), Washington, D.C. pp. 856-863. August 21-24, 2003.

See Also

ProcessData, input_miss, select.process

Examples

# example for dataset without missing values
data(data_test)

# class label must be factor
data_test[,ncol(data_test)]<-as.factor(data_test[,ncol(data_test)])
disc<-"MDL"
threshold=0.2
attrs.nominal=numeric()
out=select.fast.filter(data_test, disc.method=disc, threshold=threshold,
attrs.nominal=attrs.nominal)

# example for dataset with missing values
data(leukemia_miss)
xdata=leukemia_miss

# class label must be factor
xdata[,ncol(xdata)]<-as.factor(xdata[,ncol(xdata)])

# nominal features must be factors
attrs.nominal=101
xdata[,attrs.nominal]<-as.factor(xdata[,attrs.nominal])

delThre=0.2
out=input_miss(xdata,"mean.value",attrs.nominal,delThre)
if(out$flag.miss)
{
 xdata=out$data
}
disc<-"MDL"
threshold=0.2
out=select.fast.filter(xdata, disc.method=disc, threshold=threshold,
attrs.nominal=attrs.nominal)

[Package Biocomb version 0.4 Index]