SelectionVar {MSclassifR} | R Documentation |
Variable selection using random forests, logistic regression methods or sparse partial least squares discriminant analysis (sPLS-DA).
Description
This function performs variable selection (i.e. selection of discriminant mass-over-charge values) using either recursive feature elimination (RFE) algorithm with Random Forest, or logistic regression model, or sparse partial least squares discriminant analysis (sPLS-DA) or methods based on the distribution of variable importances of random forests.
Usage
SelectionVar(X,
Y,
MethodSelection = c("RFERF", "RFEGlmnet", "VSURF", "sPLSDA", "mda", "cvp"),
MethodValidation = c("cv", "repeatedcv", "LOOCV"),
PreProcessing = c("center", "scale", "nzv", "corr"),
Metric = c("Kappa", "Accuracy"), Sampling = c("no", "up","down", "smote"),
NumberCV = NULL,
RepeatsCV = NULL,
Sizes,
Ntree = 1000,
ncores = 2,
threshold = 0.01,
ncomp.max = 10,
nbf=0)
Arguments
X |
a numeric |
Y |
a |
MethodSelection |
a |
MethodValidation |
a |
NumberCV |
a |
RepeatsCV |
a |
PreProcessing |
a |
Metric |
a |
Sampling |
a |
Sizes |
a numeric |
Ntree |
a |
ncores |
a |
ncomp.max |
a |
threshold |
a |
nbf |
a |
Details
The selection of variables can be carried out with two different objectives: either to find a minimum number of variables allowing to obtain the highest possible accuracy (or Kappa coefficient), which involves the possible elimination of variables correlated between them (i.e. not bringing any additional predictive power with respect to some other variables); or to find all the variables in the dataset with a potential predictive power ("discriminant" variables).
The VSURF
method attempts to accomplish only the first objective.
The mda
and cvp
methods attempt to accomplish the second objective, as do the methods available in the SelectionVarStat
function of our MSclassifR
R package.
The RFERF
, RFEGlmnet
and sPLSDA
methods take as input a number of variables to be selected(Sizes
argument), and can therefore be used with both objectives.
Within the framework of the second objective, either the mda
or cvp
methods can be used to estimate a number of discriminant variables from the importances of variables. The SelectionVarStat
function can also be used to estimate this number from distributions of p-values. Of note, be sure that the Ntree
argument is high enough to get a robust estimation with the mda
or cvp
methods.
The "RFEGlmnet"
and "RFERF"
methods are based on recursive feature elimination and can either optimize the kappa coefficient or the accuracy as metrics when selecting variables.
The "sPLSDA"
method selects variables from the ones kept in latent components of the sparse PLS-DA model using an automatic choice of the number of components (when the balanced classification error rate (BER) reaches a plateau - see argument threshold
).
The "mda"
and "cvp"
methods use the distribution of variable importances to estimate the number of discriminant features (mass-over-charge values). Briefly, the distribution of variable importances for useless (not discriminant) features is firstly estimated from negative importance variables by the method proposed in section 2.6 of Janitza et al.(2018). Next, the following mixture model is assumed:
F(x)=\pi\times F_u(x)+(1-\pi)\times F_d(x)
where F
is the empirical cumulative distribution of variable importances of all the features, F_u
the one of the useless features, F_d
the one of the discriminative features, and \pi
is the proportion of useless features in the dataset.
From the estimated distribution of useless features, we can estimate quantile values x_q
and compute \epsilon_q=min(F(x_q)/q;1)
for each quantile q
. The minimum of the \epsilon_q
corresponds to the estimated proportion of useless features in the dataset, what allows estimating the number of discriminant features by N_d=floor(N\times (1 - \pi))
where N is the total number of features. Next, the N_d
features with the highest variable importances are selected.
The "VSURF"
and "sPLSDA"
methods use the minimum mean out-of-bag (OOB) and balanced classification error rate (BER) metrics respectively.
For Sampling
methods available for unbalanced data: "up"
corresponds to the up-sampling method which consists of random sampling (with replacement) so that the minority class is the same size as the majority class; "down"
corresponds to the down-sampling method randomly which consists of random sampling (without replacement) of the majority class so that their class frequencies match the minority class; "smote"
corresponds to the Synthetic Minority Over sampling Technique (SMOTE) specific algorithm for data augmentation which consist of creates new data from minority class using the K Nearest Neighbor algorithm.
See rfe
in the caret
R package, VSURF
in the VSURF
R package, splsda
in the mixOmics
R package, importance
function in the randomForest
R package, and CVPVI
function in the vita
R package for more details.
Value
A list composed of:
sel_moz |
a |
For the "RFERF"
and "RFEGlmnet"
methods, it also returns the results of the rfe
function of the caret
R package.
For the "VSURF"
method, it also returns the results of the results of the VSURF
function of the VSURF
R package.
For the "sPLSDA"
method, it also returns the following items:
Raw_data |
a horizontal bar plot and containing the contribution of features on each component. |
selected_variables |
|
For the "mda"
and "cvp"
methods, it also returns the following items:
nb_to_sel |
a numeric value corresponding to an estimated number of mass-over-chage values where the intensities are significantly different between categories (see details). |
imp_sel |
a vector containing the variable importances for the selected features. |
References
Kuhn, Max. (2012). The caret Package. Journal of Statistical Software. 28.
Genuer, Robin, Jean-Michel Poggi and Christine Tuleau-Malot. VSURF : An R Package for Variable Selection Using Random Forests. R J. 7 (2015): 19.
Friedman J, Hastie T, Tibshirani R (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22.
Kim-Anh Le Cao, Florian Rohart, Ignacio Gonzalez, Sebastien Dejean with key contributors Benoit Gautier, Francois, Bartolo, contributions from Pierre Monget, Jeff Coquery, FangZou Yao and Benoit Liquet. (2016). mixOmics: Omics. Data Integration Project. R package version 6.1.1. https://CRAN.R-project.org/package=mixOmics
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. J. Artif. Int. Res. 16, 1 (January 2002), 321–357.
Branco P, Ribeiro R, Torgo L (2016). “UBL: an R Package for Utility-Based Learning.” CoRR, abs/1604.08079.
Janitza, S., Celik, E., Boulesteix, A. L. (2018). A computationally fast variable importance test for random forests for high-dimensional data. Advances in Data Analysis and Classification, 12, 885-915.
Examples
library("MSclassifR")
library("MALDIquant")
###############################################################################
## 1. Pre-processing of mass spectra
# load mass spectra and their metadata
data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR")
# standard pre-processing of mass spectra
spectra <- MSclassifR::SignalProcessing(CitrobacterRKIspectra)
# detection of peaks in pre-processed mass spectra
peaks <- MSclassifR::PeakDetection(x = spectra, averageMassSpec=FALSE)
# matrix with intensities of peaks arranged in rows (each column is a mass-over-charge value)
IntMat <- MALDIquant::intensityMatrix(peaks)
rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot)
# remove missing values in the matrix
IntMat[is.na(IntMat)] <- 0
# normalize peaks according to the maximum intensity value for each mass spectrum
IntMat <- apply(IntMat,1,function(x) x/(max(x)))
# transpose the matrix for statistical analysis
X <- t(IntMat)
# define the known categories of mass spectra for the classification
Y <- factor(CitrobacterRKImetadata$Species)
###############################################################################
## 2. Perform variables selection using SelectionVar with RFE and random forest
# with 5 to 10 variables,
# up sampling method and trained with the Kappa coefficient metric
a <- SelectionVar(X,
Y,
MethodSelection = c("RFERF"),
MethodValidation = c("cv"),
PreProcessing = c("center","scale","nzv","corr"),
NumberCV = 2,
Metric = "Kappa",
Sizes = c(5:10),
Sampling = "up")
# Plotting peaks on the first pre-processed mass spectrum and highlighting the
# discriminant mass-over-charge values with red lines
PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
Peaks2=a$sel_moz,col_spec="blue",col_peak="black")
###############################################################################
## 3. Perform variables selection using SelectionVar with VSURF
# This function can last a few minutes
b <- SelectionVar(X, Y, MethodSelection = c("VSURF"))
summary(b$result)
###############################################################################
## 4. Perform variables selection using SelectionVar with "mda" or "cvp"
# option 1: Using mean decrease in accuracy
# with no sampling method
c <- SelectionVar(X,Y,MethodSelection="mda",Ntree=10*ncol(X))
# Estimation of the number of peaks to discriminate species
c$nb_to_sel
# Discriminant mass-over-charge values
c$sel_moz
# Plotting peaks on the first pre-processed mass spectrum and highlighting the
# discriminant mass-over-charge values with red lines
PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
Peaks2=c$sel_moz,col_spec="blue",col_peak="black")
# option 2: Using cross-validated permutation variable importance measures (more "time-consuming")
# with no sampling method
d <- SelectionVar(X,Y,MethodSelection="cvp",NumberCV=2,ncores=2,Ntree=1000)
# Estimation of the number of peaks to discriminate species
d$nb_to_sel
# Discriminant mass-over-charge values
d$sel_moz
# Plotting peaks on the first pre-processed mass spectrum and highlighting the
# discriminant mass-over-charge values with red lines
PlotSpectra(SpectralData=spectra[[1]],Peaks=peaks[[1]],
Peaks2=d$sel_moz,col_spec="blue",col_peak="black")
# Mass-over charge values found with both methods ("mda" and "cvp")
intersect(c$sel_moz,d$sel_moz)