PredictLogReg {MSclassifR} | R Documentation |
Prediction of the category to which a mass spectrum belongs from a multinomial logistic regression model
Description
This function predicts the category (species, phenotypes...) to which a mass spectrum belongs from a set of shortlisted mass-over-charge values of interest and a short-listed multinomial logistic regression model (see LogReg
).
Usage
PredictLogReg(peaks,
model,
moz,
tolerance = 6,
toleranceStep = 2,
normalizeFun = TRUE,
noMatch=0,
Reference = NULL)
Arguments
peaks |
a list of |
model |
a model or a list of models estimated from a set of shortlisted mass-over-charge values (output of the |
moz |
a |
tolerance |
a |
toleranceStep |
a |
normalizeFun |
a |
noMatch |
a |
Reference |
a |
Details
The PredictLogReg
function allows predicting the membership of a mass spectrum to a category from a multinomial regression model. The mass spectrum from the peaks
object will be matched to the discriminant mass-over-chage (m/z) values (sel_moz
object from the SelectionVar
or SelectionVarStat
functions) with a tolerance between 2 m/z and defined by the tolerance
parameter (by default this value is 6 Da). If a repetition of a same m/z occurs in the selection, only the m/z that is closest in mass peaks (moz
) is used. When no match, intensity values are replaced by the noMatch
argument. If no m/z values from peaks
object matched with the m/z in the object moz
, the tolerance will be increased according to a numeric value defined in the toleranceStep
parameter and a warning will be notified. Note that it is possible to not perform the SelectionVar
function prior to the PredictLogReg
function, and to replace the argument moz
by all the m/z values present in a mass spectrum.
Value
Returns a dataframe
containing probabilities of membership by category for each mass spectrum in peaks
. The method used is provided in the method
column. The comb_fisher
method is the result of the Fisher's method when merging probabilities of membership of used prediction models.The max_vote
method is the result of the maximum voting from used prediction models.
If the Reference
parameter is not null, the function returns:
Confusion.Matrix |
a |
Gobal.stat |
a |
Details.stat |
a |
Correct.ClassificationFreq |
a |
Incorrect.ClassificationFreq |
a |
References
Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of statistical software, 28(1), 1-26.
Examples
library("MSclassifR")
library("MALDIquant")
###############################################################################
## 1. Pre-processing of mass spectra
# load mass spectra and their metadata
data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR")
# standard pre-processing of mass spectra
spectra <- SignalProcessing(CitrobacterRKIspectra)
# detection of peaks in pre-processed mass spectra
peaks <- MSclassifR::PeakDetection(x = spectra, averageMassSpec=FALSE)
# matrix with intensities of peaks arranged in rows (each column is a mass-over-charge value)
IntMat <- MALDIquant::intensityMatrix(peaks)
rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot)
# remove missing values in the matrix
IntMat[is.na(IntMat)] <- 0
# normalize peaks according to the maximum intensity value for each mass spectrum
IntMat <- apply(IntMat,1,function(x) x/(max(x)))
# transpose the matrix for statistical analysis
X <- t(IntMat)
# define the known categories of mass spectra for the classification
Y <- factor(CitrobacterRKImetadata$Species)
###############################################################################
## 2. Selection of discriminant mass-over-charge values using RFERF
# with 5 to 10 variables,
# without sampling method and trained
# with the Accuracy coefficient metric
a <- MSclassifR::SelectionVar(X,
Y,
MethodSelection = c("RFERF"),
MethodValidation = c("cv"),
PreProcessing = c("center","scale","nzv","corr"),
NumberCV = 2,
Metric = "Kappa",
Sizes = c(5:10))
sel_moz=a$sel_moz
###############################################################################
## 3. Perform LogReg from shortlisted discriminant mass-over-charge values
# linear multinomial regression
# without sampling mehod and
# trained with the Kappa coefficient metric
model_lm=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=2,
repeats=2,
Metric = "Kappa")
# Estimated model:
model_lm
# nonlinear multinomial regression using neural networks
# with up-sampling method and
# trained with the Kappa coefficient metric
model_nn=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=2,
repeats=2,
kind="nnet",
Metric = "Kappa",
Sampling = "up")
# Estimated model:
model_nn
# nonlinear multinomial regression using random forests
# without down-sampling method and
# trained with the Kappa coefficient metric
model_rf=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=2,
repeats=2,
kind="rf",
Metric = "Kappa",
Sampling = "down")
# Estimated model:
model_rf
# nonlinear multinomial regression using xgboost
# with down-sampling method and
# trained with the Kappa coefficient metric
model_xgb=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=2,
repeats=2,
kind="xgb",
Metric = "Kappa",
Sampling = "down")
# Estimated model:
model_xgb
# nonlinear multinomial regression using svm
# with down-sampling method and
# trained with the Kappa coefficient metric
model_svm=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=2,
repeats=2,
kind="svm",
Metric = "Kappa",
Sampling = "down")
# Estimated model:
model_svm
# Of note, you can also load a model already saved
# (see example in LogReg function) for the next step
###############################################################################
## 4. Probabilities of belonging to each category for the mass spectra
## and associated statitics
# Collect all the estimated models in a list
Models <- list(model_lm$train_mod,
model_nn$train_mod,
model_rf$train_mod,
model_xgb$train_mod,
model_svm$train_mod)
# Predict classes of mass spectra with 6 Da of tolerance for matching peaks.
prob_cat=MSclassifR::PredictLogReg(peaks = peaks[c(1:5)],
model = Models,
moz = sel_moz,
tolerance = 6,
Reference = Y[c(1:5)])
################################################################################
## 5. Example of meta-classifiers based on several random forest models
## to optimize a Kappa value using the SMOTE method for imbalanced datasets.
## -> a merge of the prediction probabilities using the Fisher's method
## leads generally to robust prediction models.
#Selecting peaks with mda method
a=SelectionVar(X,Y,MethodSelection="mda",Ntree=5*ncol(X))
sel_moz=a$sel_moz
#Building 4 Random Forest models
models=NULL;nbm=4;
for (i in 1:nbm){
model_rf=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=5,
repeats=5,
kind="rf",
Metric = "Kappa",
Sampling = "smote")
models <- c(models,list(model_rf$train_mod))
}
#Combining their prediction probabilities
prob_cat=MSclassifR::PredictLogReg(peaks = peaks,model = models,moz = sel_moz,
tolerance = 6,Reference = Y)