SelectionVarStat {MSclassifR} | R Documentation |
Variable selection using multiple statistical tests.
Description
This function performs a statistical test for each mass-over-charge value to determine which are discriminants between categories. Using the distribution of resulting multiple p-values, it determines an expected number of discriminant features, and adjusted p-values that can be used to control a false discovery rate threshold.
Usage
SelectionVarStat(X,
Y,
stat.test = "Limma",
pi0.method="abh",
fdr=0.05,
Sampling = c("no", "up","down", "smote"))
Arguments
X |
a |
Y |
a |
stat.test |
a |
pi0.method |
a |
fdr |
a |
Sampling |
a |
Details
The SelectionVarStat
function allows performing "quick" classification of mass-over-charge values. It tries to find all the mass-over-charge values (or the number of mass-over-charge values) that are discriminant between categories. This can conduct to select "correlated" mass-over-charge values (i.e. associated to intensities evolving similarly between categories). By default, multiple moderated t-tests using the limma
R package (bayesian regularization of variances) are performed and the p-values are corrected using an adaptive Benjamini and Hochberg procedure to control the false discovery rate. Different ways to estimate the proportion of true null hypotheses (object pi0
returned by the function - see the cp4p
R package for details) can be used for the adaptive Benjamini-Hochberg procedure ("abh
" by defaut).
Value
A list composed of:
nb_to_sel |
a |
sel_moz |
a |
ap |
a |
References
Gianetto, Quentin & Combes, Florence & Ramus, Claire & Bruley, Christophe & Coute, Yohann & Burger, Thomas. (2015). Technical Brief Calibration Plot for Proteomics (CP4P): A graphical tool to visually check the assumptions underlying FDR control in quantitative experiments. Proteomics. 16. 10.1002/pmic.201500189.
Examples
library("MSclassifR")
library("MALDIquant")
###############################################################################
## 1. Pre-processing of mass spectra
# load mass spectra and their metadata
data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR")
# standard pre-processing of mass spectra
spectra <- MSclassifR::SignalProcessing(CitrobacterRKIspectra)
# detection of peaks in pre-processed mass spectra
peaks <- MSclassifR::PeakDetection(x = spectra, labels = CitrobacterRKImetadata$Strain_name_spot)
# matrix with intensities of peaks arranged in rows (each column is a mass-over-charge value)
IntMat <- MALDIquant::intensityMatrix(peaks)
rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot)
# remove missing values in the matrix
IntMat[is.na(IntMat)] <- 0
# normalize peaks according to the maximum intensity value for each mass spectrum
IntMat <- apply(IntMat,1,function(x) x/(max(x)))
# transpose the matrix for statistical analysis
X <- t(IntMat)
# define the known categories of mass spectra for the classification
Y <- factor(CitrobacterRKImetadata$Species)
###############################################################################
## 2. Estimate the optimal number of peaks to discriminate the different species
OptiPeaks <- SelectionVarStat(X,
Y,
stat.test = "Limma",
pi0.method="abh",
fdr=0.05,
Sampling="smote")
## Estimation of the optimal number of peaks to discriminate species (from the pi0 parameter)
OptiPeaks$nb_to_sel
## discriminant mass-over-chage values estimated using a 5 per cent false discovery rate
OptiPeaks$sel_moz
## p-values and adjusted p-values estimated for all the tested mass-over-charge values
OptiPeaks$ap$adjp