select.inf.chi2 {Biocomb}R Documentation

Ranks the features

Description

This function calculates the features weights using the chi-squared (\chi^2) statistic and performs the ranking of the features. It can handle both numerical and nominal values. At first it performs the discretization of the numerical features values, according to several optional discretization methods using the function ProcessData. This function measures the worth of a feature by computing the value of the \chi^2 statistic with respect to the class.The results is in the form of “data.frame”, consisting of the following fields: features (Biomarker) names, values of the chi-squared statistic and the positions of the features in the dataset. The features in the data.frame are sorted according to the chi-squared statistic values. This function is used internally to perform the classification with feature selection using the function “classifier.loop” with argument “Chi-square” for feature selection. The variable “NumberFeature” of the data.frame is passed to the classification function.

Usage

select.inf.chi2(matrix,disc.method,attrs.nominal)

Arguments

matrix

a dataset, a matrix of feature values for several cases, the last column is for the class labels. Class labels could be numerical or character values. The maximal number of classes is ten.

disc.method

a method used for feature discretization.The discretization options include minimal description length (MDL), equal frequency and equal interval width methods.

attrs.nominal

a numerical vector, containing the column numbers of the nominal features, selected for the analysis.

Details

This function's main job is to rank the features according to chi-squared statistic. See the “Value” section to this page for more details. Before starting it calls the ProcessData function to make the discretization of numerical features.

Data can be provided in matrix form, where the rows correspond to cases with feature values and class label. The columns contain the values of individual features and the last column must contain class labels. The maximal number of class labels equals 10. The class label features and all the nominal features must be defined as factors.

Value

The data can be provided with reasonable number of missing values that must be at first preprocessed with one of the imputing methods in the function input_miss. A returned data.frame consists of the the following fields:

Biomarker

a character vector of feature names

ChiSquare

a numeric vector of chi-squared values for the features according to class

NumberFeature

a numerical vector of the positions of the features in the dataset

References

Y. Wang, I.V. Tetko, M.A. Hall, E. Frank, A. Facius, K.F.X. Mayer, and H.W. Mewes, "Gene Selection from Microarray Data for Cancer Classification—A Machine Learning Approach," Computational Biology and Chemistry, vol. 29, no. 1, pp. 37-46, 2005.

See Also

ProcessData, input_miss, select.process

Examples

# example for dataset without missing values
data(data_test)

# class label must be factor
data_test[,ncol(data_test)]<-as.factor(data_test[,ncol(data_test)])
disc<-"equal interval width"
attrs.nominal=numeric()
out=select.inf.chi2(data_test,disc.method=disc,attrs.nominal=attrs.nominal)

# example for dataset with missing values
data(leukemia_miss)
xdata=leukemia_miss

# class label must be factor
xdata[,ncol(xdata)]<-as.factor(xdata[,ncol(xdata)])

# nominal features must be factors
attrs.nominal=101
xdata[,attrs.nominal]<-as.factor(xdata[,attrs.nominal])

delThre=0.2
out=input_miss(xdata,"mean.value",attrs.nominal,delThre)
if(out$flag.miss)
{
 xdata=out$data
}
disc<-"equal interval width"
out=select.inf.chi2(xdata,disc.method=disc,attrs.nominal=attrs.nominal)

[Package Biocomb version 0.4 Index]