spls.cv {plsgenomics}R Documentation

Cross-validation procedure to calibrate the parameters (ncomp, lambda.l1) of the Adaptive Sparse PLS regression

Description

The function spls.cv chooses the optimal values for the hyper-parameter of the spls procedure, by minimizing the mean squared error of prediction over the hyper-parameter grid, using Durif et al. (2018) adaptive SPLS algorithm.

Usage

spls.cv(
  X,
  Y,
  lambda.l1.range,
  ncomp.range,
  weight.mat = NULL,
  adapt = TRUE,
  center.X = TRUE,
  center.Y = TRUE,
  scale.X = TRUE,
  scale.Y = TRUE,
  weighted.center = FALSE,
  return.grid = FALSE,
  ncores = 1,
  nfolds = 10,
  nrun = 1,
  verbose = FALSE
)

Arguments

X

a (n x p) data matrix of predictors. X must be a matrix. Each row corresponds to an observation and each column to a predictor variable.

Y

a (n) vector of (continuous) responses. Y must be a vector or a one column matrix. It contains the response variable for each observation.

lambda.l1.range

a vecor of positive real values, in [0,1]. lambda.l1 is the sparse penalty parameter for the dimension reduction step by sparse PLS (see details), the optimal value will be chosen among lambda.l1.range.

ncomp.range

a vector of positive integers. ncomp is the number of PLS components. The optimal value will be chosen among ncomp.range.

weight.mat

a (ntrain x ntrain) matrix used to weight the l2 metric in the observation space, it can be the covariance inverse of the Ytrain observations in a heteroskedastic context. If NULL, the l2 metric is the standard one, corresponding to homoskedastic model (weight.mat is the identity matrix).

adapt

a boolean value, indicating whether the sparse PLS selection step sould be adaptive or not (see details).

center.X

a boolean value indicating whether the data matrices Xtrain and Xtest (if provided) should be centered or not.

center.Y

a boolean value indicating whether the response values Ytrain set should be centered or not.

scale.X

a boolean value indicating whether the data matrices Xtrain and Xtest (if provided) should be scaled or not (scale.X=TRUE implies center.X=TRUE).

scale.Y

a boolean value indicating whether the response values Ytrain should be scaled or not (scale.Y=TRUE implies center.Y=TRUE).

weighted.center

a boolean value indicating whether the centering should take into account the weighted l2 metric or not (if TRUE, it requires that weighted.mat is non NULL).

return.grid

a boolean values indicating whether the grid of hyper-parameters values with corresponding mean prediction error rate over the folds should be returned or not.

ncores

a positve integer, indicating the number of cores that the cross-validation is allowed to use for parallel computation (see details).

nfolds

a positive integer indicating the number of folds in the K-folds cross-validation procedure, nfolds=n corresponds to the leave-one-out cross-validation, default is 10.

nrun

a positive integer indicating how many times the K-folds cross- validation procedure should be repeated, default is 1.

verbose

a boolean value indicating verbosity.

Details

The columns of the data matrices Xtrain and Xtest may not be standardized, since standardizing can be performed by the function spls.cv as a preliminary step.

The procedure is described in Durif et al. (2018). The K-fold cross-validation can be summarize as follow: the train set is partitioned into K folds, for each value of hyper-parameters the model is fit K times, using each fold to compute the prediction error rate, and fitting the model on the remaining observations. The cross-validation procedure returns the optimal hyper-parameters values, meaning the one that minimize the mean squared error of prediction averaged over all the folds.

This procedures uses the mclapply from the parallel package, available on GNU/Linux and MacOS. Users of Microsoft Windows can refer to the README file in the source to be able to use a mclapply type function.

Value

An object with the following attributes

lambda.l1.opt

the optimal value in lambda.l1.range.

ncomp.opt

the optimal value in ncomp.range.

cv.grid

the grid of hyper-parameters and corresponding prediction error rate over the folds. cv.grid is NULL if return.grid is set to FALSE.

Author(s)

Ghislain Durif (https://gdurif.perso.math.cnrs.fr/).

References

Durif, G., Modolo, L., Michaelsson, J., Mold, J.E., Lambert-Lacroix, S., Picard, F., 2018. High dimensional classification with combined adaptive sparse PLS and logistic regression. Bioinformatics 34, 485–493. doi:10.1093/bioinformatics/btx571. Available at http://arxiv.org/abs/1502.05933.

See Also

spls

Examples

## Not run: 
### load plsgenomics library
library(plsgenomics)

### generating data
n <- 100
p <- 100
sample1 <- sample.cont(n=n, p=p, kstar=10, lstar=2, 
                       beta.min=0.25, beta.max=0.75, mean.H=0.2, 
                       sigma.H=10, sigma.F=5, sigma.E=5)
                       
X <- sample1$X
Y <- sample1$Y

### hyper-parameters values to test
lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1
ncomp.range <- 1:10

### tuning the hyper-parameters
cv1 <- spls.cv(X=X, Y=Y, lambda.l1.range=lambda.l1.range, 
               ncomp.range=ncomp.range, weight.mat=NULL, adapt=TRUE, 
               center.X=TRUE, center.Y=TRUE, 
               scale.X=TRUE, scale.Y=TRUE, weighted.center=FALSE, 
               return.grid=TRUE, ncores=1, nfolds=10, nrun=1)
str(cv1)

### otpimal values
cv1$lambda.l1.opt
cv1$ncomp.opt

## End(Not run)


[Package plsgenomics version 1.5-3 Index]