R: Cross model validation

MVA.cmv {RVAideMemoire}

R Documentation

Cross model validation

Description

Performs cross model validation (2CV) with different PLS analyses.

Usage

MVA.cmv(X, Y, repet = 10, kout = 7, kinn = 6, ncomp = 8, scale = TRUE,
  model = c("PLSR", "CPPLS", "PLS-DA", "PPLS-DA", "PLS-DA/LDA", "PLS-DA/QDA",
  "PPLS-DA/LDA", "PPLS-DA/QDA"), crit.inn = c("RMSEP", "Q2", "NMC"),
  Q2diff = 0.05, lower = 0.5, upper = 0.5, Y.add = NULL, weights = rep(1, nrow(X)),
  set.prior = FALSE, crit.DA = c("plug-in", "predictive", "debiased"), ...)

Arguments

`X`	a data frame of independent variables.
`Y`	the dependent variable(s): numeric vector, data frame of quantitative variables or factor.
`repet`	an integer giving the number of times the whole 2CV procedure has to be repeated.
`kout`	an integer giving the number of folds in the outer loop (can be re-set internally if needed).
`kinn`	an integer giving the number of folds in the inner loop (can be re-set internally if needed). Cannot be `> kout`.
`ncomp`	an integer giving the maximal number of components to be tested in the inner loop (can be re-set depending on the size of the train sets).
`scale`	logical indicating if data should be scaled (see Details).
`model`	the model to be fitted (see Details).
`crit.inn`	the criterion to be used to choose the number of components in the inner loop. Root Mean Square Error of Prediction (`"RMSEP"`, default) and Q2 (`"Q2"`) are only used for PLSR and CPPLS, whereas the Number of MisClassifications (`"NMC"`) is only used for discriminant analyses.
`Q2diff`	the threshold to be used if the number of components is chosen according to Q2. The next component is added only if it makes the Q2 increase more than `Q2diff` (5% by default).
`lower`	a vector of lower limits for power optimisation in CPPLS or PPLS-DA (see `cppls.fit`).
`upper`	a vector of upper limits for power optimisation in CPPLS or PPLS-DA (see `cppls.fit`).
`Y.add`	a vector or matrix of additional responses containing relevant information about the observations, in CPPLS or PPLS-DA (see `cppls.fit`).
`weights`	a vector of individual weights for the observations, in CPPLS or PPLS-DA (see `cppls.fit`).
`set.prior`	only used when a second analysis (LDA or QDA) is performed. If `TRUE`, the prior probabilities of class membership are defined according to the mean weight of individuals belonging to each class. If `FALSE`, prior probabilities are obtained from the data sets on which LDA/QDA models are built.
`crit.DA`	criterion used to predict class membership when a second analysis (LDA or QDA) is used. See `predict.lda`.
`...`	other arguments to pass to `plsr` (PLSR, PLS-DA) or `cppls` (CPPLS, PPLS-DA).

Details

Cross model validation is detailed is Szymanska et al (2012). Some more details about how this function works:

- when a discriminant analysis is used ("PLS-DA", "PPLS-DA", "PLS-DA/LDA", "PLS-DA/QDA", "PPLS-DA/LDA" or "PPLS-DA/QDA"), the training sets (test set itself in the inner loop, test+validation sets in the outer loop) are generated in respect to the relative proportions of the levels of Y in the original data set (see splitf).

- "PLS-DA" is considered as PLS2 on a dummy-coded response. For a PLS-DA based on the CPPLS algorithm, use "PPLS-DA" with lower and upper limits of the power parameters set to 0.5.

- if a second analysis is used ("PLS-DA/LDA", "PLS-DA/QDA", "PPLS-DA/LDA" or "PPLS-DA/QDA"), a LDA or QDA is built on scores of the first analysis (PLS-DA or PPLS-DA) also in the inner loop. The classification error rate, based on this second analysis, is used to choose the number of components.

If scale = TRUE, the scaling is done as this:

- for each step of the outer loop (i.e. kout steps), the rest set is pre-processed by centering and unit-variance scaling. Means and standard deviations of variables in the rest set are then used to scale the test set.

- for each step of the inner loop (i.e. kinn steps), the training set is pre-processed by centering and unit-variance scaling. Means and standard deviations of variables in the training set are then used to scale the validation set.

Value

`model`	model used.
`type`	type of model used.
`repet`	number of times the whole 2CV procedure was repeated.
`kout`	number of folds in the outer loop.
`kinn`	number of folds in the inner loop.
`crit.inn`	criterion used to choose the number of components in the inner loop.
`crit.DA`	criterion used to classify individuals of the test and validation sets.
`Q2diff`	threshold used if the number of components is chosen according to Q2.
`groups`	levels of `Y` if it is a factor.
`models.list`	list of of models generated (`repet*kout` models), for PLSR, CPPLS, PLS-DA and PPLS-DA.
`models1.list`	list of of (P)PLS-DA models generated (`repet*kout` models), for PLS-DA/LDA, PLS-DA/QDA, PPLS-DA/LDA and PPLS-DA/QDA.
`models2.list`	list of of LDA/QDA models generated (`repet*kout` models), for PLS-DA/LDA, PLS-DA/QDA, PPLS-DA/LDA and PPLS-DA/QDA.
`RMSEP`	RMSEP computed from the models used in the outer loops (`repet` values).
`Q2`	Q2 computed from the models used in the outer loops (`repet` values).
`NMC`	Classification error rate computed from the models used in the outer loops (`repet` values).
`confusion`	Confusion matrices computed from the models used in the outer loops (`repet` values).
`pred.prob`	Probability of each individual of being of each level of `Y`.

Author(s)

Maxime HERVE <maxime.herve@univ-rennes1.fr>

References

Szymanska E, Saccenti E, Smilde AK and Westerhuis J (2012) Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies. Metabolomics (2012) 8:S3-S16.

Examples

require(pls)
require(MASS)

# PLSR
data(yarn)
## Not run: MVA.cmv(yarn$NIR,yarn$density,model="PLSR")

# PPLS-DA coupled to LDA
data(mayonnaise)
## Not run: MVA.cmv(mayonnaise$NIR,factor(mayonnaise$oil.type),model="PPLS-DA/LDA",crit.inn="NMC")

[Package RVAideMemoire version 0.9-83-7 Index]