R: Cross-validation procedure to calibrate the parameters...

multinom.spls.cv {plsgenomics}

R Documentation

Cross-validation procedure to calibrate the parameters (ncomp, lambda.l1, lambda.ridge) for the multinomial-SPLS method

Description

The function multinom.spls.cv chooses the optimal values for the hyper-parameter of the multinom.spls procedure, by minimizing the averaged error of prediction over the hyper-parameter grid, using Durif et al. (2018) multinomial-SPLS algorithm.

Usage

multinom.spls.cv(
  X,
  Y,
  lambda.ridge.range,
  lambda.l1.range,
  ncomp.range,
  adapt = TRUE,
  maxIter = 100,
  svd.decompose = TRUE,
  return.grid = FALSE,
  ncores = 1,
  nfolds = 10,
  nrun = 1,
  center.X = TRUE,
  scale.X = FALSE,
  weighted.center = TRUE,
  seed = NULL,
  verbose = TRUE
)

Arguments

`X`	a (n x p) data matrix of predictors. `X` must be a matrix. Each row corresponds to an observation and each column to a predictor variable.
`Y`	a (n) vector of (continuous) responses. `Y` must be a vector or a one column matrix. It contains the response variable for each observation. `Y` should take values in {0,...,nclass-1}, where nclass is the number of class.
`lambda.ridge.range`	a vector of positive real values. `lambda.ridge` is the Ridge regularization parameter for the RIRLS algorithm (see details), the optimal value will be chosen among `lambda.ridge.range`.
`lambda.l1.range`	a vecor of positive real values, in [0,1]. `lambda.l1` is the sparse penalty parameter for the dimension reduction step by sparse PLS (see details), the optimal value will be chosen among `lambda.l1.range`.
`ncomp.range`	a vector of positive integers. `ncomp` is the number of PLS components. The optimal value will be chosen among `ncomp.range`.
`adapt`	a boolean value, indicating whether the sparse PLS selection step sould be adaptive or not (see details).
`maxIter`	a positive integer, the maximal number of iterations in the RIRLS algorithm (see details).
`svd.decompose`	a boolean parameter. `svd.decompose` indicates wether or not the predictor matrix `Xtrain` should be decomposed by SVD (singular values decomposition) for the RIRLS step (see details).
`return.grid`	a boolean values indicating whether the grid of hyper-parameters values with corresponding mean prediction error rate over the folds should be returned or not.
`ncores`	a positve integer, indicating the number of cores that the cross-validation is allowed to use for parallel computation (see details).
`nfolds`	a positive integer indicating the number of folds in the K-folds cross-validation procedure, `nfolds=n` corresponds to the leave-one-out cross-validation, default is 10.
`nrun`	a positive integer indicating how many times the K-folds cross- validation procedure should be repeated, default is 1.
`center.X`	a boolean value indicating whether the data matrices `Xtrain` and `Xtest` (if provided) should be centered or not.
`scale.X`	a boolean value indicating whether the data matrices `Xtrain` and `Xtest` (if provided) should be scaled or not (`scale.X=TRUE` implies `center.X=TRUE`) in the spls step.
`weighted.center`	a boolean value indicating whether the centering should take into account the weighted l2 metric or not in the SPLS step.
`seed`	a positive integer value (default is NULL). If non NULL, the seed for pseudo-random number generation is set accordingly.
`verbose`	a boolean parameter indicating the verbosity.

Details

The columns of the data matrices X may not be standardized, since standardizing is performed by the function multinom.spls.cv as a preliminary step.

The procedure is described in Durif et al. (2018). The K-fold cross-validation can be summarize as follow: the train set is partitioned into K folds, for each value of hyper-parameters the model is fit K times, using each fold to compute the prediction error rate, and fitting the model on the remaining observations. The cross-validation procedure returns the optimal hyper-parameters values, meaning the one that minimize the averaged error of prediction averaged over all the folds.

This procedures uses mclapply from the parallel package, available on GNU/Linux and MacOS. Users of Microsoft Windows can refer to the README file in the source to be able to use a mclapply type function.

Value

An object of class multinom.spls with the following attributes

`lambda.ridge.opt`	the optimal value in `lambda.ridge.range`.
`lambda.l1.opt`	the optimal value in `lambda.l1.range`.
`ncomp.opt`	the optimal value in `ncomp.range`.
`conv.per`	the overall percentage of models that converge during the cross-validation procedure.
`cv.grid`	the grid of hyper-parameters and corresponding prediction error rate averaged over the folds. `cv.grid` is NULL if `return.grid` is set to FALSE.

Author(s)

Ghislain Durif (https://gdurif.perso.math.cnrs.fr/).

References

Durif, G., Modolo, L., Michaelsson, J., Mold, J.E., Lambert-Lacroix, S., Picard, F., 2018. High dimensional classification with combined adaptive sparse PLS and logistic regression. Bioinformatics 34, 485–493. doi:10.1093/bioinformatics/btx571. Available at http://arxiv.org/abs/1502.05933.

Examples

## Not run: 
### load plsgenomics library
library(plsgenomics)

### generating data
n <- 100
p <- 100
nclass <- 3
sample1 <- sample.multinom(n=n, p=p, nb.class=nclass, kstar=10, lstar=2, 
                           beta.min=0.25, beta.max=0.75, mean.H=0.2, 
                           sigma.H=10, sigma.F=5)

X <- sample1$X
Y <- sample1$Y

### hyper-parameters values to test
lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1
ncomp.range <- 1:10
# log-linear range between 0.01 a,d 1000 for lambda.ridge.range
logspace <- function( d1, d2, n) exp(log(10)*seq(d1, d2, length.out=n))
lambda.ridge.range <- signif(logspace(d1 <- -2, d2 <- 3, n=21), digits=3)

### tuning the hyper-parameters
cv1 <- multinom.spls.cv(X=X, Y=Y, lambda.ridge.range=lambda.ridge.range, 
                        lambda.l1.range=lambda.l1.range, 
                        ncomp.range=ncomp.range, 
                        adapt=TRUE, maxIter=100, svd.decompose=TRUE, 
                        return.grid=TRUE, ncores=1, nfolds=10)
                       
str(cv1)

## End(Not run)

[Package plsgenomics version 1.5-3 Index]