R: Cross-validation procedure to calibrate the parameters...

spls.cv {plsgenomics}

R Documentation

Cross-validation procedure to calibrate the parameters (ncomp, lambda.l1) of the Adaptive Sparse PLS regression

Description

The function spls.cv chooses the optimal values for the hyper-parameter of the spls procedure, by minimizing the mean squared error of prediction over the hyper-parameter grid, using Durif et al. (2018) adaptive SPLS algorithm.

Usage

spls.cv(
  X,
  Y,
  lambda.l1.range,
  ncomp.range,
  weight.mat = NULL,
  adapt = TRUE,
  center.X = TRUE,
  center.Y = TRUE,
  scale.X = TRUE,
  scale.Y = TRUE,
  weighted.center = FALSE,
  return.grid = FALSE,
  ncores = 1,
  nfolds = 10,
  nrun = 1,
  verbose = FALSE
)

Arguments

`X`	a (n x p) data matrix of predictors. `X` must be a matrix. Each row corresponds to an observation and each column to a predictor variable.
`Y`	a (n) vector of (continuous) responses. `Y` must be a vector or a one column matrix. It contains the response variable for each observation.
`lambda.l1.range`	a vecor of positive real values, in [0,1]. `lambda.l1` is the sparse penalty parameter for the dimension reduction step by sparse PLS (see details), the optimal value will be chosen among `lambda.l1.range`.
`ncomp.range`	a vector of positive integers. `ncomp` is the number of PLS components. The optimal value will be chosen among `ncomp.range`.
`weight.mat`	a (ntrain x ntrain) matrix used to weight the l2 metric in the observation space, it can be the covariance inverse of the Ytrain observations in a heteroskedastic context. If NULL, the l2 metric is the standard one, corresponding to homoskedastic model (`weight.mat` is the identity matrix).
`adapt`	a boolean value, indicating whether the sparse PLS selection step sould be adaptive or not (see details).
`center.X`	a boolean value indicating whether the data matrices `Xtrain` and `Xtest` (if provided) should be centered or not.
`center.Y`	a boolean value indicating whether the response values `Ytrain` set should be centered or not.
`scale.X`	a boolean value indicating whether the data matrices `Xtrain` and `Xtest` (if provided) should be scaled or not (`scale.X=TRUE` implies `center.X=TRUE`).
`scale.Y`	a boolean value indicating whether the response values `Ytrain` should be scaled or not (`scale.Y=TRUE` implies `center.Y=TRUE`).
`weighted.center`	a boolean value indicating whether the centering should take into account the weighted l2 metric or not (if TRUE, it requires that weighted.mat is non NULL).
`return.grid`	a boolean values indicating whether the grid of hyper-parameters values with corresponding mean prediction error rate over the folds should be returned or not.
`ncores`	a positve integer, indicating the number of cores that the cross-validation is allowed to use for parallel computation (see details).
`nfolds`	a positive integer indicating the number of folds in the K-folds cross-validation procedure, `nfolds=n` corresponds to the leave-one-out cross-validation, default is 10.
`nrun`	a positive integer indicating how many times the K-folds cross- validation procedure should be repeated, default is 1.
`verbose`	a boolean value indicating verbosity.

Details

The columns of the data matrices Xtrain and Xtest may not be standardized, since standardizing can be performed by the function spls.cv as a preliminary step.

The procedure is described in Durif et al. (2018). The K-fold cross-validation can be summarize as follow: the train set is partitioned into K folds, for each value of hyper-parameters the model is fit K times, using each fold to compute the prediction error rate, and fitting the model on the remaining observations. The cross-validation procedure returns the optimal hyper-parameters values, meaning the one that minimize the mean squared error of prediction averaged over all the folds.

This procedures uses the mclapply from the parallel package, available on GNU/Linux and MacOS. Users of Microsoft Windows can refer to the README file in the source to be able to use a mclapply type function.

Value

An object with the following attributes

`lambda.l1.opt`	the optimal value in `lambda.l1.range`.
`ncomp.opt`	the optimal value in `ncomp.range`.
`cv.grid`	the grid of hyper-parameters and corresponding prediction error rate over the folds. `cv.grid` is NULL if `return.grid` is set to FALSE.

Author(s)

Ghislain Durif (https://gdurif.perso.math.cnrs.fr/).

References

Durif, G., Modolo, L., Michaelsson, J., Mold, J.E., Lambert-Lacroix, S., Picard, F., 2018. High dimensional classification with combined adaptive sparse PLS and logistic regression. Bioinformatics 34, 485–493. doi:10.1093/bioinformatics/btx571. Available at http://arxiv.org/abs/1502.05933.

Examples

## Not run: 
### load plsgenomics library
library(plsgenomics)

### generating data
n <- 100
p <- 100
sample1 <- sample.cont(n=n, p=p, kstar=10, lstar=2, 
                       beta.min=0.25, beta.max=0.75, mean.H=0.2, 
                       sigma.H=10, sigma.F=5, sigma.E=5)
                       
X <- sample1$X
Y <- sample1$Y

### hyper-parameters values to test
lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1
ncomp.range <- 1:10

### tuning the hyper-parameters
cv1 <- spls.cv(X=X, Y=Y, lambda.l1.range=lambda.l1.range, 
               ncomp.range=ncomp.range, weight.mat=NULL, adapt=TRUE, 
               center.X=TRUE, center.Y=TRUE, 
               scale.X=TRUE, scale.Y=TRUE, weighted.center=FALSE, 
               return.grid=TRUE, ncores=1, nfolds=10, nrun=1)
str(cv1)

### otpimal values
cv1$lambda.l1.opt
cv1$ncomp.opt

## End(Not run)

[Package plsgenomics version 1.5-3 Index]