LogReg {MSclassifR} | R Documentation |
Estimation of a multinomial regression to predict the category to which a mass spectrum belongs
Description
This function estimates a multinomial regression using cross-validation to predict the category (species, phenotypes...) to which a mass spectrum belongs from a set of shortlisted mass-over-charge values corresponding to discriminant peaks. Two main kinds of models can be estimated: linear or nonlinear (with neural networks, random forests, support vector machines with linear kernel, or eXtreme Gradient Boosting). Hyperparameters are randomly searched, except for the eXtreme Gradient Boosting where a grid search is performed.
Usage
LogReg(X,
moz,
Y,
number = 2,
repeats = 2,
Metric = c("Kappa", "Accuracy", "F1", "AdjRankIndex", "MatthewsCorrelation"),
kind="linear",
Sampling = c("no", "up", "down", "smote"))
Arguments
X |
|
moz |
|
Y |
|
number |
|
Metric |
a |
repeats |
|
kind |
If |
Sampling |
a |
Details
This function estimates a model from a library of mass spectra for which we already know the category to which they belong (ex.: species, etc). This model can next be used to predict the category of a new coming spectrum for which the category is unknown (see PredictLogReg
).
The estimation is performed using the train
function of the caret
R package. For each kind of model, random parameters are tested to find a model according to the best metric
. The formulas for the metric
are as follows:
Accuracy = Number Of Correct Predictions/Total Number Of Predictions
Kappa coefficient = (Observed Agreement-Chance Agreement)/(1-Chance Agreement)
F1 = True Positive/(True Positive + 1/2 (False Positive + False Negative))
The adjusted Rand index ("AdjRankIndex"
) is defined as the corrected-for-chance version of the Rand index which allows comparing two groups (see mclust
package and adjustedRandIndex()
function for more details). The Matthews correlation coefficient ("MatthewsCorrelation"
) is estimated using mcc
function in the mltools
R package.
The Sampling
methods available for imbalanced data are: "up"
to the up-sampling method which consists of random sampling (with replacement) so that the minority class is the same size as the majority class; "down"
to the down-sampling method which consists of random sampling (without replacement) of the majority class so that their class frequencies match the minority class; "smote"
to the Synthetic Minority Over sampling Technique (SMOTE) algorithm for data augmentation which consists of creating new data from minority class using the K Nearest Neighbor algorithm.
Value
Returns a list
with four items:
train_mod |
a |
conf_mat |
a confusion matrix containing percentages classes of predicted categories in function of actual categories, resulting from repeated cross-validation. |
stats_global |
a |
boxplot |
a |
References
Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of statistical software, 28(1), 1-26.
L. Hubert and P. Arabie (1985) Comparing Partitions, Journal of the Classification, 2, pp. 193-218.
Scrucca L, Fop M, Murphy TB, Raftery AE (2016). mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal.
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. J. Artif. Int. Res. 16, 1.
Matthews, B. W. (1975). "Comparison of the predicted and observed secondary structure of T4 phage lysozyme". Biochimica et Biophysica Acta (BBA) - Protein Structure. PMID 1180967.
Examples
library("MSclassifR")
library("MALDIquant")
###############################################################################
## 1. Pre-processing of mass spectra
# load mass spectra and their metadata
data("CitrobacterRKIspectra","CitrobacterRKImetadata", package = "MSclassifR")
# standard pre-processing of mass spectra
spectra <- SignalProcessing(CitrobacterRKIspectra)
# detection of peaks in pre-processed mass spectra
peaks <- MSclassifR::PeakDetection(x = spectra, averageMassSpec=FALSE)
# matrix with intensities of peaks arranged in rows (each column is a mass-over-charge value)
IntMat <- MALDIquant::intensityMatrix(peaks)
rownames(IntMat) <- paste(CitrobacterRKImetadata$Strain_name_spot)
# remove missing values in the matrix
IntMat[is.na(IntMat)] <- 0
# normalize peaks according to the maximum intensity value for each mass spectrum
IntMat <- apply(IntMat,1,function(x) x/(max(x)))
# transpose the matrix for statistical analysis
X <- t(IntMat)
# define the known categories of mass spectra for the classification
Y <- factor(CitrobacterRKImetadata$Species)
###############################################################################
## 2. Selection of discriminant mass-over-charge values using RFERF
# with 5 to 10 variables,
# up sampling method and
# trained with the Accuracy coefficient metric
a <- MSclassifR::SelectionVar(X,
Y,
MethodSelection = c("RFERF"),
MethodValidation = c("cv"),
PreProcessing = c("center","scale","nzv","corr"),
NumberCV = 2,
Metric = "Accuracy",
Sizes = c(2:5),
Sampling = "up")
sel_moz=a$sel_moz
###############################################################################
## 3. Perform LogReg from shortlisted discriminant mass-over-charge values
# linear multinomial regression
# without sampling mehod
# and trained with the Kappa coefficient metric
model_lm=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=2,
repeats=2,
Metric = "Kappa")
# Estimated model:
model_lm
# nonlinear multinomial regression using neural networks
# with up-sampling method and
# trained with the Kappa coefficient metric
model_nn=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=2,
repeats=2,
kind="nnet",
Metric = "Kappa",
Sampling = "up")
# Estimated model:
model_nn
# nonlinear multinomial regression using random forests
# without down-sampling method and
# trained with the Kappa coefficient metric
model_rf=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=2,
repeats=2,
kind="rf",
Metric = "Kappa",
Sampling = "down")
# Estimated model:
model_rf
# nonlinear multinomial regression using xgboost
# with down-sampling method and
# trained with the Kappa coefficient metric
model_xgb=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=2,
repeats=2,
kind="xgb",
Metric = "Kappa",
Sampling = "down")
# Estimated model:
model_xgb
# nonlinear multinomial regression using svm
# with down-sampling method and
# trained with the Kappa coefficient metric
model_svm=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=2,
repeats=2,
kind="svm",
Metric = "Kappa",
Sampling = "down")
# Estimated model:
model_svm
##########
# Of note, step 3 can be performed several times
# to find optimal models
# because of random hyperparameter search
###############################################################################
## 4. Select best models in term of average Kappa and saving it for reuse
Kappa_model=c(model_lm$stats_global[1,2],model_nn$stats_global[1,2],
model_rf$stats_global[1,2],model_xgb$stats_global[1,2],model_svm$stats_global[1,2])
names(Kappa_model)=c("lm","nn","rf","xgb","svm")
#Best models in term of accuracy
Kappa_model[which(Kappa_model==max(Kappa_model))]
#save best models for reuse
#models=list(model_lm$train_mod,model_nn$train_mod,model_rf$train_mod,
#model_xgb$train_mod,model_svm$train_mod)
#models_best=models[which(Kappa_model==max(Kappa_model))]
#for (i in 1:length(models_best)){
#save(models_best[[i]], file = paste0("model_best_",i,".rda",collapse="")
#}
#load a saved model
#load("model_best_1.rda")
###############################################################################
## 5. Try other metrics to select the best model
# linear multinomial regression
# with up-sampling method and
# trained with the Adjusted Rank index metric
model_lm=MSclassifR::LogReg(X=X,
moz=sel_moz,
Y=factor(Y),
number=2,
repeats=3,
Metric = "AdjRankIndex",
Sampling = "up")