SelectionVarStat {MSclassifR}R Documentation

Variable selection using multiple statistical tests.

Description

This function performs a statistical test for each mass-over-charge value to determine which are discriminants between categories. Using the distribution of resulting multiple p-values, it determines an expected number of discriminant features, and adjusted p-values that can be used to control a false discovery rate threshold.

Usage


SelectionVarStat(X,
                 Y,
                 stat.test = "Limma",
                 pi0.method="abh",
                 fdr=0.05,
                 Sampling = c("no", "up","down", "smote"))

Arguments

X

a numeric matrix corresponding to a library of mass spectra. Rows of X are the intensities of a mass spectrum measured on mass-over-charge values. The columns are mass-over-charge values.

Y

a factor with a length equal to the number of rows in X and containing the categories of each mass spectrum in X.

stat.test

a character among "anova", "kruskal", or "Limma" (default). It corresponds to the test used to know if the intensity measured at a mass-over-charge value is significantly different between categories. "anova" is for a classical ANOVA Fisher test, "kruskal" is for the Kruskal-Wallis test, "Limma" is for an ANOVA Fisher test using the limma R package.

pi0.method

a character among "abh", "st.spline", "st.boot", "langaas", "histo", "pounds", "jiang", "slim". It corresponds to statistical methods used to estimate the proportion of true null hypotheses among the set of tested mass-over-charge values. See the estim.pi0 function of the R package cp4p for details.

fdr

a numeric value corresponding to False Discovery Rate threshold used to determine the differential mass-over-charge values. 0.05 by default.

Sampling

a character indicating an optional subsampling method to handle imbalanced datasets: subsampling methods are either "no" (no subsampling), "up", "down" or "smote". "no" by default.

Details

The SelectionVarStat function allows performing "quick" classification of mass-over-charge values. It tries to find all the mass-over-charge values (or the number of mass-over-charge values) that are discriminant between categories. This can conduct to select "correlated" mass-over-charge values (i.e. associated to intensities evolving similarly between categories). By default, multiple moderated t-tests using the limma R package (bayesian regularization of variances) are performed and the p-values are corrected using an adaptive Benjamini and Hochberg procedure to control the false discovery rate. Different ways to estimate the proportion of true null hypotheses (object pi0 returned by the function - see the cp4p R package for details) can be used for the adaptive Benjamini-Hochberg procedure ("abh" by defaut).

Value

A list composed of:

nb_to_sel

a numeric value corresponding to an estimation of the optimal number of mass-over-charge values to discriminate between different groups.

sel_moz

a vector with selected discriminant mass-over-charge values.

ap

a list composed of pi0 the proportion of non-discriminant mass-over-charge values, and adjp a matrix of raw p-values and corresponding ajusted p-values for all the mass-over-charge values that have been tested.

References

Gianetto, Quentin & Combes, Florence & Ramus, Claire & Bruley, Christophe & Coute, Yohann & Burger, Thomas. (2015). Technical Brief Calibration Plot for Proteomics (CP4P): A graphical tool to visually check the assumptions underlying FDR control in quantitative experiments. Proteomics. 16. 10.1002/pmic.201500189.

Examples


library("MSclassifR")
library("MALDIquant")

###############################################################################
## 1. Pre-processing of mass spectra

# load mass spectra and their metadata
data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR")
# standard pre-processing of mass spectra
spectra <- MSclassifR::SignalProcessing(CitrobacterRKIspectra)
# detection of peaks in pre-processed mass spectra
peaks <- MSclassifR::PeakDetection(x = spectra, labels = CitrobacterRKImetadata$Strain_name_spot)
# matrix with intensities of peaks arranged in rows (each column is a mass-over-charge value)
IntMat <- MALDIquant::intensityMatrix(peaks)
rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot)
# remove missing values in the matrix
IntMat[is.na(IntMat)] <- 0
# normalize peaks according to the maximum intensity value for each mass spectrum
IntMat <- apply(IntMat,1,function(x) x/(max(x)))
# transpose the matrix for statistical analysis
X <- t(IntMat)
# define the known categories of mass spectra for the classification
Y <- factor(CitrobacterRKImetadata$Species)

###############################################################################
## 2. Estimate the optimal number of peaks to discriminate the different species

OptiPeaks <- SelectionVarStat(X,
                              Y,
                              stat.test = "Limma",
                              pi0.method="abh",
                              fdr=0.05,
                              Sampling="smote")
             
## Estimation of the optimal number of peaks to discriminate species (from the pi0 parameter)
OptiPeaks$nb_to_sel

## discriminant mass-over-chage values estimated using a 5 per cent false discovery rate
OptiPeaks$sel_moz

## p-values and adjusted p-values estimated for all the tested mass-over-charge values
OptiPeaks$ap$adjp



[Package MSclassifR version 0.3.3 Index]