Biocomb-package {Biocomb}R Documentation

Tools for Data Mining

Description

Functions to make the data analysis with the emphasis on biological data. They can deal with both numerical and nominal features. Biocomb includes functions for several feature ranking, feature selection algorithms. The feature ranking is based on several criteria: information gain, symmetrical uncertainty, chi-squared statistic etc. There are a number of features selection algorithms: Chi2 algorithm, based on chi-squared test, fast correlation-based filter algorithm, feature weighting algorithm (RelieF), sequential forward search algorithm (CorrSF), Correlation-based feature selection algorithm (CFS). Package includes several classification algorithms with embedded feature selection and validation schemes. It includes also the functions for calculation of feature AUC (Area Under the ROC Curve) values with statistical significance analysis, calculation of Area Above the RCC (AAC) values. For two- and multi-class problems it is possible to use functions for HUM (hypervolume under manifold) calculation and construction 2D- and 3D- ROC curves. Relative Cost Curves (RCC) are provided to estimate the classifier performance under unequal misclassification costs.
Biocomb has a special function to deal with missing values, including different imputing schemes.

Details

Package: Biocomb
Type: Package
Version: 0.3
Date: 2016-08-14
License: GPL (>= 3)

Biocomb package presents the functions for two stages of data mining process: feature selection and classification. One of the main functions of Biocomb is the select.process function. It presents the infrostructure to perform the feature ranking or feature selection for the data set with two or more class labels. Functions compute.aucs, select.inf.gain, select.inf.symm and select.inf.chi2 calculate the different criterion measure for each feature in the dataset. Function select.fast.filter realizes the fast correlation-based filter method. Function chi2.algorithm performes Chi2 discretization algorithms with feature selection. Function select.forward.Corr is designed for the sequential forward features search according to the correlation measure. Function select.forward.wrapper is the realization of the wrapper feature selection method with sequential forward search strategy. The auxiliary function ProcessData performs the discretization of the numerical features and is called from the several functions for feature selection. The second main function of the Biocomb is classifier.loop which presents the infrastructure for the classifier construction with the embedded feature selection and using the different schemes for the performance validation. The functions compute.aucs, compute.auc.permutation, pauc, pauclog, compute.auc.random are the functions for calculation of feature AUC (Area Under the ROC Curve) values with statistical significance analysis. The functions plotRoc.curves is assigned for the construction of the ROC curve in 2D-space. The functions cost.curve plots the RCC and calculates the corresponding AAC to estimate the classifier performance under unequal misclassification costs problem. The function input_miss deals with missing value problem and realizes the two methods of missing value imputing. The function generate.data.miss allows to generate the dataset with missing values from the input dataset in order to test the algorithms, which are designed to deal with missing values problem. The functions CalculateHUM_seq, CalculateHUM_ROC, CalculateHUM_Plot are for HUM calculation and construction 2D- and 3D- ROC curves.

Function

select.process Perform the features ranking or features selection
compute.aucs Calculate the AUC values
select.inf.gain Calculate the Information Gain criterion
select.inf.symm Calculate the Symmetrical uncertainty criterion
select.inf.chi2 Calculate the chi-squared statistic
select.fast.filter Select the feature subset with fast correlation-based filter method
chi2.algorithm Select the feature subset with Chi2 discretization algorithm.
select.forward.Corr Select the feature subset with forward search strategy and correlation measure
select.forward.wrapper Select the feature subset with a wrapper method
ProcessData Perform the discretization of the numerical features
classifier.loop Perform the classification with the embedded feature selection
pauc Calculate the p-values of the statistical significance of the two-class difference
pauclog Calculate the logarithm of p-values of the statistical significance
compute.auc.permutation Compute the p-value of the significance of the AUC using the permutation test
compute.auc.random Compute the p-value of the significance of the AUC using random sample generation
plotRoc.curves Plot the ROC curve in 2D-space
CalculateHUM_seq Calculate a maximal HUM value and the corresponding permutation of class labels
CalculateHUM_Ex Calculate the HUM values with exaustive serach for specified number of class labels
CalculateHUM_ROC Function to construct and plot the 2D- or 3d-ROC curve
CalcGene Compute the HUM value for one feature
CalcROC Compute the point coordinates to plot the 2D- or 3D-ROC curve
CalculateHUM_Plot Plot the 2D-ROC curve
Calculate3D Plot the 3D-ROC curve
cost.curve Plot the RCC and calculate the AAC for unequal misclassification costs
input_miss Perform the missing values imputation
generate.data.miss Generate the dataset with missing values

Dataset

This package comes with two simulated datasets and a real dataset of leukemia patients with 72 cases and 101 features. The last feature is the class (disease labels).

Installing and using

To install this package, make sure you are connected to the internet and issue the following command in the R prompt:

    install.packages("Biocomb")
  

To load the package in R:

    library(Biocomb)
  

Author(s)

Natalia Novoselova, Junxi Wang,Frank Pessler,Frank Klawonn

Maintainer: Natalia Novoselova <novos65@mail.ru>

References

H. Liu and L. Yu. "Toward Integrating Feature Selection Algorithms for Classification and Clustering", IEEE Trans. on Knowledge and Data Engineering, pdf, 17(4), 491-502, 2005.
L. Yu and H. Liu. "Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution". In Proceedings of The Twentieth International Conference on Machine Leaning (ICML-03), Washington, D.C. pp. 856-863. August 21-24, 2003.
Y. Wang, I.V. Tetko, M.A. Hall, E. Frank, A. Facius, K.F.X. Mayer, and H.W. Mewes, "Gene Selection from Microarray Data for Cancer Classification?A Machine Learning Approach," Computational Biology and Chemistry, vol. 29, no. 1, pp. 37-46, 2005.
Olga Montvida and Frank Klawonn Relative cost curves: An alternative to AUC and an extension to 3-class problems,Kybernetika 50 no. 5, 647-660, 2014

See Also

CRAN packages arules or discretization for feature discretization. CRAN packages pROC for ROC curves. CRAN packages FSelector for chi-squared test, forward search strategy. CRAN packages pamr for nearest shrunken centroid classifier, CRAN packages MASS, e1071, randomForest,class, nnet, rpart are used in this package.

Examples

data(data_test)
# class label must be factor
data_test[,ncol(data_test)]<-as.factor(data_test[,ncol(data_test)])

# Perform the feature selection using the fast correlation-based filter algorithm
disc="MDL"
threshold=0.2
attrs.nominal=numeric()
out=select.fast.filter(data_test,disc.method=disc,threshold=threshold,
attrs.nominal=attrs.nominal)

# Perform the classification with cross-validation of results
out=classifier.loop(data_test,classifiers=c("svm","lda","rf"),
 feature.selection="auc", flag.feature=FALSE,method.cross="fold-crossval")

# Calculate the coordinates for 2D- or 3D- ROC curve and the optimal threshold point
## Not run: data(data_test)
xllim<--4
xulim<-4
yllim<-30
yulim<-110

attrs.no=1
pos.Class<-levels(data_test[,ncol(data_test)])[1]
add.legend<-TRUE

aacs<-rep(0,length(attrs.no))
color<-c(1:length(attrs.no))

out <- cost.curve(data_test,attrs.no, pos.Class,col=color[1],add=F,
 xlim=c(xllim,xulim),ylim=c(yllim,yulim))

## End(Not run)

[Package Biocomb version 0.4 Index]