R: Prediction of the category to which a mass spectrum belongs...

PredictLogReg {MSclassifR}

R Documentation

Prediction of the category to which a mass spectrum belongs from a multinomial logistic regression model

Description

This function predicts the category (species, phenotypes...) to which a mass spectrum belongs from a set of shortlisted mass-over-charge values of interest and a short-listed multinomial logistic regression model (see LogReg).

Usage

PredictLogReg(peaks,
              model,
              moz,
              tolerance = 6,
              toleranceStep = 2,
              normalizeFun = TRUE,
              noMatch=0,
              Reference = NULL)

Arguments

`peaks`	a list of `MassPeaks` objects (see `MALDIquant` R package).
`model`	a model or a list of models estimated from a set of shortlisted mass-over-charge values (output of the `LogReg` function).
`moz`	a `vector` with the set of shortlisted mass-over-charge values used to estimate the model `Model`.
`tolerance`	a `numeric` value of accepted tolerance to match peaks to the set of shortlisted mass-over-charge values. It is fixed to 6 Da by default.
`toleranceStep`	a `numeric` value added to the `tolerance` parameter to match peaks to the set of shortlisted mass-over-charge values. It is fixed to 2 Da by default.
`normalizeFun`	a `logical` value, if `TRUE` (default) the maximum intensity will be equal to 1, the other intensities will be expressed in ratio to this maximum.
`noMatch`	a `numeric` value used to replace intensity values if there is no match detected between peaks and the set of shortlisted mass-over-charge values `moz`. It is fixed to 0 by default.
`Reference`	a `factor` with a length equal to the number of rows in `X` and containing the categories of each mass spectrum in `X`. `"NULL"` by default.

Details

The PredictLogReg function allows predicting the membership of a mass spectrum to a category from a multinomial regression model. The mass spectrum from the peaks object will be matched to the discriminant mass-over-chage (m/z) values (sel_moz object from the SelectionVar or SelectionVarStat functions) with a tolerance between 2 m/z and defined by the tolerance parameter (by default this value is 6 Da). If a repetition of a same m/z occurs in the selection, only the m/z that is closest in mass peaks (moz) is used. When no match, intensity values are replaced by the noMatch argument. If no m/z values from peaks object matched with the m/z in the object moz, the tolerance will be increased according to a numeric value defined in the toleranceStep parameter and a warning will be notified. Note that it is possible to not perform the SelectionVar function prior to the PredictLogReg function, and to replace the argument moz by all the m/z values present in a mass spectrum.

Value

Returns a dataframe containing probabilities of membership by category for each mass spectrum in peaks. The method used is provided in the method column. The comb_fisher method is the result of the Fisher's method when merging probabilities of membership of used prediction models.The max_vote method is the result of the maximum voting from used prediction models.

If the Reference parameter is not null, the function returns:

`Confusion.Matrix`	a `list` of confusion matrix (cross-tabulation with associated statitics) corresponding to the output of the `confusionMatrix` function of the `caret` R package.
`Gobal.stat`	a `data.frame` with three columns corresponding to the value (`value` column) of a statistic parameter (`Statistic.parameter` column) from a method used (`model` column) obtained with the `LogReg` function. See `LogReg` function for the Statistic.parameter column.
`Details.stat`	a `data.frame` with four columns corresponding to the same as `Gobal.stat` dataframe with the class concerned for estimated statistic parameter (class column). All statistic parameters are extracted from the output of the `confusionMatrix` function of the `caret` R package.
`Correct.ClassificationFreq`	a `data.frame` with predicted class (Prediction column) from a method (Model column) and the reference of the categories of each mass spectrum (Reference column). The `Freq` column indicates the number of times the category was correctly predicted by the method.
`Incorrect.ClassificationFreq`	a `data.frame` with predicted class (Prediction column) from a method (`Model` column) and the reference of the categories of each mass spectrum (`Reference` column). The `Freq` column indicates the number of times the category was not correctly predicted by the method.

References

Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of statistical software, 28(1), 1-26.

Examples



library("MSclassifR")
library("MALDIquant")

###############################################################################
## 1. Pre-processing of mass spectra

# load mass spectra and their metadata
data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR")
# standard pre-processing of mass spectra
spectra <- SignalProcessing(CitrobacterRKIspectra)
# detection of peaks in pre-processed mass spectra
peaks <- MSclassifR::PeakDetection(x = spectra, averageMassSpec=FALSE)
# matrix with intensities of peaks arranged in rows (each column is a mass-over-charge value)
IntMat <- MALDIquant::intensityMatrix(peaks)
rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot)
# remove missing values in the matrix
IntMat[is.na(IntMat)] <- 0
# normalize peaks according to the maximum intensity value for each mass spectrum
IntMat <- apply(IntMat,1,function(x) x/(max(x)))
# transpose the matrix for statistical analysis
X <- t(IntMat)
# define the known categories of mass spectra for the classification
Y <- factor(CitrobacterRKImetadata$Species)

###############################################################################
## 2. Selection of discriminant mass-over-charge values using RFERF
# with 5 to 10 variables,
# without sampling method and trained
# with the Accuracy coefficient metric

a <- MSclassifR::SelectionVar(X,
                              Y,
                              MethodSelection = c("RFERF"),
                              MethodValidation = c("cv"),
                              PreProcessing = c("center","scale","nzv","corr"),
                              NumberCV = 2,
                              Metric = "Kappa",
                              Sizes = c(5:10))

sel_moz=a$sel_moz

###############################################################################
## 3. Perform LogReg from shortlisted discriminant mass-over-charge values

# linear multinomial regression
# without sampling mehod and
# trained with the Kappa coefficient metric

model_lm=MSclassifR::LogReg(X=X,
                            moz=sel_moz,
                            Y=factor(Y),
                            number=2,
                            repeats=2,
                            Metric = "Kappa")
# Estimated model:
model_lm

# nonlinear multinomial regression using neural networks
# with up-sampling method and
# trained with the Kappa coefficient metric

model_nn=MSclassifR::LogReg(X=X,
                            moz=sel_moz,
                            Y=factor(Y),
                            number=2,
                            repeats=2,
                            kind="nnet",
                            Metric = "Kappa",
                            Sampling = "up")
# Estimated model:
model_nn

# nonlinear multinomial regression using random forests
# without down-sampling method and
# trained with the Kappa coefficient metric

model_rf=MSclassifR::LogReg(X=X,
                            moz=sel_moz,
                            Y=factor(Y),
                            number=2,
                            repeats=2,
                            kind="rf",
                            Metric = "Kappa",
                            Sampling = "down")

# Estimated model:
model_rf

# nonlinear multinomial regression using xgboost
# with down-sampling method and
# trained with the Kappa coefficient metric

model_xgb=MSclassifR::LogReg(X=X,
                             moz=sel_moz,
                             Y=factor(Y),
                             number=2,
                             repeats=2,
                             kind="xgb",
                             Metric = "Kappa",
                             Sampling = "down")
# Estimated model:
model_xgb

# nonlinear multinomial regression using svm
# with down-sampling method and
# trained with the Kappa coefficient metric

model_svm=MSclassifR::LogReg(X=X,
                             moz=sel_moz,
                             Y=factor(Y),
                             number=2,
                             repeats=2,
                             kind="svm",
                             Metric = "Kappa",
                             Sampling = "down")
# Estimated model:
model_svm

# Of note, you can also load a model already saved
# (see example in LogReg function) for the next step
###############################################################################
## 4. Probabilities of belonging to each category for the mass spectra
## and associated statitics

# Collect all the estimated models in a list

Models <- list(model_lm$train_mod,
               model_nn$train_mod,
               model_rf$train_mod,
               model_xgb$train_mod,
               model_svm$train_mod)

# Predict classes of mass spectra with 6 Da of tolerance for matching peaks.
prob_cat=MSclassifR::PredictLogReg(peaks = peaks[c(1:5)],
                                   model = Models,
                                   moz = sel_moz,
                                   tolerance = 6,
                                   Reference = Y[c(1:5)])

################################################################################
## 5. Example of meta-classifiers based on several random forest models
## to optimize a Kappa value using the SMOTE method for imbalanced datasets.
## -> a merge of the prediction probabilities using the Fisher's method
## leads generally to robust prediction models.

#Selecting peaks with mda method
a=SelectionVar(X,Y,MethodSelection="mda",Ntree=5*ncol(X))
sel_moz=a$sel_moz

#Building 4 Random Forest models
models=NULL;nbm=4;
for (i in 1:nbm){
  model_rf=MSclassifR::LogReg(X=X,
                             moz=sel_moz,
                             Y=factor(Y),
                             number=5,
                             repeats=5,
                             kind="rf",
                             Metric = "Kappa",
                             Sampling = "smote")
  models <- c(models,list(model_rf$train_mod))
}

#Combining their prediction probabilities
prob_cat=MSclassifR::PredictLogReg(peaks = peaks,model = models,moz = sel_moz,
tolerance = 6,Reference = Y)

[Package MSclassifR version 0.3.3 Index]