R: Variable selection using random forests, logistic regression...

SelectionVar {MSclassifR}

R Documentation

Variable selection using random forests, logistic regression methods or sparse partial least squares discriminant analysis (sPLS-DA).

Description

This function performs variable selection (i.e. selection of discriminant mass-over-charge values) using either recursive feature elimination (RFE) algorithm with Random Forest, or logistic regression model, or sparse partial least squares discriminant analysis (sPLS-DA) or methods based on the distribution of variable importances of random forests.

Usage

SelectionVar(X,
             Y,
             MethodSelection = c("RFERF", "RFEGlmnet", "VSURF", "sPLSDA", "mda", "cvp"), 
             MethodValidation = c("cv", "repeatedcv", "LOOCV"),
             PreProcessing = c("center", "scale", "nzv", "corr"), 
             Metric = c("Kappa", "Accuracy"), Sampling = c("no", "up","down", "smote"),
             NumberCV = NULL,
             RepeatsCV = NULL, 
             Sizes,
             Ntree = 1000,
             ncores = 2,
             threshold = 0.01,
             ncomp.max = 10,
             nbf=0)

Arguments

`X`	a numeric `matrix` corresponding to a library of mass spectra. Each row of `X` is the intensities of a mass spectrum measured on mass-over-charge values. The columns are assumed to be mass-over-charge values.
`Y`	a `factor` with a length equal to the number of rows in `X` and containing the categories of each mass spectrum in `X`.
`MethodSelection`	a `character` indicating the method used for variables selection. Six methods are available: (1) `"RFERF"` for recursive feature elimination (RFE) coupled with random forests (see `rfe` in the `caret` R package); (2) `"RFEGlmnet"` for RFE with coupled with logistic regression; (3) `"VSURF"` for a method using random forests (see `VSURF` in the `VSURF` R package); (4) `"sPLSDA"` for a method based on sparse partial least squares discriminant analysis (see `splsda` in the `mixOmics`); (5) `"mda"` for a method selecting variables from the distribution of the "mean decrease in accuracy" variables importances of a random forest (see `importance` function in the `randomForest` R package); (6) `"cvp"` for a method selecting variables from the distribution of the cross-validated permutation variables importances of a random forest (see `CVPVI` function in the `vita` R package). Additional explanations are available in the Details section.
`MethodValidation`	a `character` indicating the resampling method:`"cv"` for cross-validation; `"repeatedcv"` for repeated cross-validation; and `"LOOCV"` for leave-one-out cross-validation. Only used for the `"RFERF"`, `"RFEGlmnet"` and `"sPLSDA"` methods.
`NumberCV`	a `numeric` value indicating the number of K-folds for cross-validation. Only used for the `"RFERF"`, `"RFEGlmnet"`, `"sPLSDA"` and `"cvp"` methods.
`RepeatsCV`	a `numeric` value indication the number of repeat(s) for K-folds for cross-validation or repeated cross-validation. Only used for the `"RFERF"`, `"RFEGlmnet"` and `"sPLSDA"` methods.
`PreProcessing`	a `vector` indicating the method(s) used to pre-process the mass spectra in `X`: centering (`"center"`), scaling (`"scale"`), eliminating near zero variance predictors (`"nzv"`), or correlated predictors (`"corr"`). Only used for the `"RFERF"`, `"RFEGlmnet"` and `"sPLSDA"` methods.
`Metric`	a `character` indicating the metric used to select the optimal model for the RFE algorithms. Possible metrics are the `"Kappa"` coefficient or the `"Accuracy"`. This argument is not used for the `"VSURF"`, `"cvp"`, `"mda"` and the `"sPLSDA"` methods of `MethodSelection`. See details of the `"SelectionVar"` function.
`Sampling`	a `character` indicating an optional subsampling method to handle imbalanced datasets: subsampling methods are either `"no"` (no subsampling), `"up"`, `"down"` or `"smote"`. `"no"` by default.
`Sizes`	a numeric `vector` indicating the number of variables to select. Only used for the `"RFERF"`, `"RFEGlmnet"` and `"sPLSDA"` methods. For the `"RFERF"` and `"RFEGlmnet"` methods, the final number of selected variables is the one giving the highest average `"Metric"` (`"Accuracy"` or `"Kappa"`) on the folds used for cross-validation. It is thus bounded by `NumberCV*max(Sizes)`. For the `"sPLSDA"` method, `Sizes` corresponds to the number of variables to test from the `X` dataset when estimating the sparse PLS-DA model (see `test.keepX` argument in the `mixOmics` R package).
`Ntree`	a `numeric` value indicating the number of trees in random forests, only used if `MethodSelection` = `"VSURF"` or `"mda"` or `"cvp"`. Note we advise to select a number highly superior to the total number of variables for a robust selection (to not miss some features in the subspaces used to build trees). It is 1000 by default.
`ncores`	a `positive integer` only used for the `cvp` method. The number of cores to use, i.e. at most how many child processes will be run simultaneously. Must be at least one, and parallelization requires at least two cores. If `ncores=0`, then the half of CPU cores on the current host are used.
`ncomp.max`	a `positive integer` indicating the maximum number of components that can be included in the sPLS-DA model (10 by default).
`threshold`	a `numeric` value corresponding to a threshold used for the optimal selection of the number of components included in the sPLS-DA model (0.01 by default). When the number of components increases and the balanced classification error rate (BER) does not change anymore, we keep the minimal number where the BER reaches a plateau (i.e. when `BER(N)-BER(N+1)<threshold`, we keep `N`). If a plateau is not reached, `ncomp.max` components are selected.
`nbf`	a `numeric` value corresponding to a number of simulated non discriminant features. This is used to improve the robustness of the estimation of the distribution of the variable importances for non discriminant features. Only used for the `"mda"` and `"cvp"` methods. 0 by default: no additional non discriminant feature is created.

Details

The selection of variables can be carried out with two different objectives: either to find a minimum number of variables allowing to obtain the highest possible accuracy (or Kappa coefficient), which involves the possible elimination of variables correlated between them (i.e. not bringing any additional predictive power with respect to some other variables); or to find all the variables in the dataset with a potential predictive power ("discriminant" variables).

The VSURF method attempts to accomplish only the first objective. The mda and cvp methods attempt to accomplish the second objective, as do the methods available in the SelectionVarStat function of our MSclassifR R package. The RFERF, RFEGlmnet and sPLSDA methods take as input a number of variables to be selected(Sizes argument), and can therefore be used with both objectives.

Within the framework of the second objective, either the mda or cvp methods can be used to estimate a number of discriminant variables from the importances of variables. The SelectionVarStat function can also be used to estimate this number from distributions of p-values. Of note, be sure that the Ntree argument is high enough to get a robust estimation with the mda or cvp methods.

The "RFEGlmnet" and "RFERF" methods are based on recursive feature elimination and can either optimize the kappa coefficient or the accuracy as metrics when selecting variables.

The "sPLSDA" method selects variables from the ones kept in latent components of the sparse PLS-DA model using an automatic choice of the number of components (when the balanced classification error rate (BER) reaches a plateau - see argument threshold).

The "mda" and "cvp" methods use the distribution of variable importances to estimate the number of discriminant features (mass-over-charge values). Briefly, the distribution of variable importances for useless (not discriminant) features is firstly estimated from negative importance variables by the method proposed in section 2.6 of Janitza et al.(2018). Next, the following mixture model is assumed: F(x)=\pi\times F_u(x)+(1-\pi)\times F_d(x) where F is the empirical cumulative distribution of variable importances of all the features, F_u the one of the useless features, F_d the one of the discriminative features, and \pi is the proportion of useless features in the dataset. From the estimated distribution of useless features, we can estimate quantile values x_q and compute \epsilon_q=min(F(x_q)/q;1) for each quantile q. The minimum of the \epsilon_q corresponds to the estimated proportion of useless features in the dataset, what allows estimating the number of discriminant features by N_d=floor(N\times (1 - \pi)) where N is the total number of features. Next, the N_d features with the highest variable importances are selected.

The "VSURF" and "sPLSDA" methods use the minimum mean out-of-bag (OOB) and balanced classification error rate (BER) metrics respectively.

For Sampling methods available for unbalanced data: "up" corresponds to the up-sampling method which consists of random sampling (with replacement) so that the minority class is the same size as the majority class; "down" corresponds to the down-sampling method randomly which consists of random sampling (without replacement) of the majority class so that their class frequencies match the minority class; "smote" corresponds to the Synthetic Minority Over sampling Technique (SMOTE) specific algorithm for data augmentation which consist of creates new data from minority class using the K Nearest Neighbor algorithm.

See rfe in the caret R package, VSURF in the VSURF R package, splsda in the mixOmics R package, importance function in the randomForest R package, and CVPVI function in the vita R package for more details.

Value

A list composed of:

sel_moz

a vector with discriminant mass-over-chage values.

For the "RFERF" and "RFEGlmnet" methods, it also returns the results of the rfe function of the caret R package.

For the "VSURF" method, it also returns the results of the results of the VSURF function of the VSURF R package.

For the "sPLSDA" method, it also returns the following items:

`Raw_data`	a horizontal bar plot and containing the contribution of features on each component.
`selected_variables`	`data frame` with uniques features (selected variables to keep and containing the contribution of features in order to class samples).See `plotLoadings` in the `mixOmics` R package for details.

For the "mda" and "cvp" methods, it also returns the following items:

`nb_to_sel`	a numeric value corresponding to an estimated number of mass-over-chage values where the intensities are significantly different between categories (see details).
`imp_sel`	a vector containing the variable importances for the selected features.

References

Kuhn, Max. (2012). The caret Package. Journal of Statistical Software. 28.

Genuer, Robin, Jean-Michel Poggi and Christine Tuleau-Malot. VSURF : An R Package for Variable Selection Using Random Forests. R J. 7 (2015): 19.

Friedman J, Hastie T, Tibshirani R (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22.

Kim-Anh Le Cao, Florian Rohart, Ignacio Gonzalez, Sebastien Dejean with key contributors Benoit Gautier, Francois, Bartolo, contributions from Pierre Monget, Jeff Coquery, FangZou Yao and Benoit Liquet. (2016). mixOmics: Omics. Data Integration Project. R package version 6.1.1. https://CRAN.R-project.org/package=mixOmics

Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. J. Artif. Int. Res. 16, 1 (January 2002), 321–357.

Branco P, Ribeiro R, Torgo L (2016). “UBL: an R Package for Utility-Based Learning.” CoRR, abs/1604.08079.

Janitza, S., Celik, E., Boulesteix, A. L. (2018). A computationally fast variable importance test for random forests for high-dimensional data. Advances in Data Analysis and Classification, 12, 885-915.

Examples




library("MSclassifR")
library("MALDIquant")

###############################################################################
## 1. Pre-processing of mass spectra

# load mass spectra and their metadata
data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR")
# standard pre-processing of mass spectra
spectra <- MSclassifR::SignalProcessing(CitrobacterRKIspectra)
# detection of peaks in pre-processed mass spectra
peaks <- MSclassifR::PeakDetection(x = spectra, averageMassSpec=FALSE)
# matrix with intensities of peaks arranged in rows (each column is a mass-over-charge value)
IntMat <- MALDIquant::intensityMatrix(peaks)
rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot)
# remove missing values in the matrix
IntMat[is.na(IntMat)] <- 0
# normalize peaks according to the maximum intensity value for each mass spectrum
IntMat <- apply(IntMat,1,function(x) x/(max(x)))
# transpose the matrix for statistical analysis
X <- t(IntMat)
# define the known categories of mass spectra for the classification
Y <- factor(CitrobacterRKImetadata$Species)

###############################################################################
## 2. Perform variables selection using SelectionVar with RFE and random forest 
# with 5 to 10 variables, 
# up sampling method and trained with the Kappa coefficient metric
a <- SelectionVar(X,
                  Y,
                  MethodSelection = c("RFERF"),
                  MethodValidation = c("cv"),
                  PreProcessing = c("center","scale","nzv","corr"),
                  NumberCV = 2,
                  Metric = "Kappa",
                  Sizes = c(5:10),
                  Sampling = "up")

# Plotting peaks on the first pre-processed mass spectrum and highlighting the 
# discriminant mass-over-charge values with red lines
PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
            Peaks2=a$sel_moz,col_spec="blue",col_peak="black")

###############################################################################
## 3. Perform variables selection using SelectionVar with VSURF 
# This function can last a few minutes
b <- SelectionVar(X, Y, MethodSelection = c("VSURF"))
summary(b$result)

###############################################################################
## 4. Perform variables selection using SelectionVar with "mda" or "cvp"
# option 1: Using mean decrease in accuracy  
# with no sampling method
c <- SelectionVar(X,Y,MethodSelection="mda",Ntree=10*ncol(X)) 

# Estimation of the number of peaks to discriminate species
c$nb_to_sel

# Discriminant mass-over-charge values 
c$sel_moz

# Plotting peaks on the first pre-processed mass spectrum and highlighting the 
# discriminant mass-over-charge values with red lines
PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
            Peaks2=c$sel_moz,col_spec="blue",col_peak="black")

# option 2: Using cross-validated permutation variable importance measures (more "time-consuming")       
# with no sampling method
d <- SelectionVar(X,Y,MethodSelection="cvp",NumberCV=2,ncores=2,Ntree=1000)

# Estimation of the number of peaks to discriminate species
d$nb_to_sel

# Discriminant mass-over-charge values 
d$sel_moz

# Plotting peaks on the first pre-processed mass spectrum and highlighting the 
# discriminant mass-over-charge values with red lines
PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
            Peaks2=d$sel_moz,col_spec="blue",col_peak="black")

# Mass-over charge values found with both methods ("mda" and "cvp")
intersect(c$sel_moz,d$sel_moz)

[Package MSclassifR version 0.3.3 Index]