R: Adaptive Sparse Partial Least Squares (SPLS) regression

spls {plsgenomics}

R Documentation

Adaptive Sparse Partial Least Squares (SPLS) regression

Description

The function spls.adapt performs compression and variable selection in the context of linear regression (with possible prediction) using Durif et al. (2018) adaptive SPLS algorithm.

Usage

spls(
  Xtrain,
  Ytrain,
  lambda.l1,
  ncomp,
  weight.mat = NULL,
  Xtest = NULL,
  adapt = TRUE,
  center.X = TRUE,
  center.Y = TRUE,
  scale.X = TRUE,
  scale.Y = TRUE,
  weighted.center = FALSE
)

Arguments

`Xtrain`	a (ntrain x p) data matrix of predictor values. `Xtrain` must be a matrix. Each row corresponds to an observation and each column to a predictor variable.
`Ytrain`	a (ntrain) vector of (continuous) responses. `Ytrain` must be a vector or a one column matrix, and contains the response variable for each observation.
`lambda.l1`	a positive real value, in [0,1]. `lambda.l1` is the sparse penalty parameter for the dimension reduction step by sparse PLS (see details).
`ncomp`	a positive integer. `ncomp` is the number of PLS components.
`weight.mat`	a (ntrain x ntrain) matrix used to weight the l2 metric in the observation space, it can be the covariance inverse of the Ytrain observations in a heteroskedastic context. If NULL, the l2 metric is the standard one, corresponding to homoskedastic model (`weight.mat` is the identity matrix).
`Xtest`	a (ntest x p) matrix containing the predictor values for the test data set. `Xtest` may also be a vector of length p (corresponding to only one test observation). Default value is NULL, meaning that no prediction is performed.
`adapt`	a boolean value, indicating whether the sparse PLS selection step sould be adaptive or not (see details).
`center.X`	a boolean value indicating whether the data matrices `Xtrain` and `Xtest` (if provided) should be centered or not.
`center.Y`	a boolean value indicating whether the response values `Ytrain` set should be centered or not.
`scale.X`	a boolean value indicating whether the data matrices `Xtrain` and `Xtest` (if provided) should be scaled or not (`scale.X=TRUE` implies `center.X=TRUE`).
`scale.Y`	a boolean value indicating whether the response values `Ytrain` should be scaled or not (`scale.Y=TRUE` implies `center.Y=TRUE`).
`weighted.center`	a boolean value indicating whether the centering should take into account the weighted l2 metric or not (if TRUE, it requires that weighted.mat is non NULL).

Details

The columns of the data matrices Xtrain and Xtest may not be standardized, since standardizing can be performed by the function spls as a preliminary step.

The procedure described in Durif et al. (2018) is used to compute latent sparse components that are used in a regression model. In addition, when a matrix Xtest is supplied, the procedure predicts the response associated to these new values of the predictors.

Value

An object of class spls with the following attributes

`Xtrain`	the ntrain x p predictor matrix.
`Ytrain`	the response observations.
`sXtrain`	the centered if so and scaled if so predictor matrix.
`sYtrain`	the centered if so and scaled if so response.
`betahat`	the linear coefficients in model `sYtrain = sXtrain %*% betahat + residuals`.
`betahat.nc`	the (p+1) vector containing the coefficients and intercept for the non centered and non scaled model `Ytrain = cbind(rep(1,ntrain),Xtrain) %*% betahat.nc + residuals.nc`.
`meanXtrain`	the (p) vector of Xtrain column mean, used for centering if so.
`sigmaXtrain`	the (p) vector of Xtrain column standard deviation, used for scaling if so.
`meanYtrain`	the mean of Ytrain, used for centering if so.
`sigmaYtrain`	the standard deviation of Ytrain, used for centering if so.
`X.score`	a (n x ncomp) matrix being the observations coordinates or scores in the new component basis produced by the compression step (sparse PLS). Each column t.k of `X.score` is a SPLS component.
`X.score.low`	a (n x ncomp) matrix being the PLS components only computed with the selected predictors.
`X.loading`	the (ncomp x p) matrix of coefficients in regression of Xtrain over the new components `X.score`.
`Y.loading`	the (ncomp) vector of coefficients in regression of Ytrain over the SPLS components `X.score`.
`X.weight`	a (p x ncomp) matrix being the coefficients of predictors in each components produced by sparse PLS. Each column w.k of `X.weight` verifies t.k = Xtrain x w.k (as a matrix product).
`residuals`	the (ntrain) vector of residuals in the model `sYtrain = sXtrain %*% betahat + residuals`.
`residuals.nc`	the (ntrain) vector of residuals in the non centered and non scaled model `Ytrain = cbind(rep(1,ntrain),Xtrain) %*% betahat.nc + residuals.nc`.
`hatY`	the (ntrain) vector containing the estimated reponse values on the train set of centered and scaled (if so) predictors `sXtrain`, `hatY = sXtrain %*% betahat`.
`hatY.nc`	the (ntrain) vector containing the estimated reponse value on the train set of non centered and non scaled predictors `Xtrain`, `hatY.nc = cbind(rep(1,ntrain),Xtrain) %*% betahat.nc`.
`hatYtest`	the (ntest) vector containing the predicted values for the response on the centered and scaled test set `sXtest` (if provided), `hatYtest = sXtest %*% betahat`.
`hatYtest.nc`	the (ntest) vector containing the predicted values for the response on the non centered and non scaled test set `Xtest` (if provided), `hatYtest.nc = cbind(rep(1,ntest),Xtest) %*% betahat.nc`.
`A`	the active set of predictors selected by the procedures. `A` is a subset of `1:p`.
`betamat`	a (ncomp) list of coefficient vector betahat in the model with `k` components, for `k=1,...,ncomp`.
`new2As`	a (ncomp) list of subset of `(1:p)` indicating the variables that are selected when constructing the components `k`, for `k=1,...,ncomp`.
`lambda.l1`	the sparse hyper-parameter used to fit the model.
`ncomp`	the number of components used to fit the model.
`V`	the (ntrain x ntrain) matrix used to weight the metric in the sparse PLS step.
`adapt`	a boolean value, indicating whether the sparse PLS selection step was adaptive or not.

Author(s)

Ghislain Durif (https://gdurif.perso.math.cnrs.fr/).

Adapted in part from spls code by H. Chun, D. Chung and S.Keles (https://CRAN.R-project.org/package=spls).

References

Durif, G., Modolo, L., Michaelsson, J., Mold, J.E., Lambert-Lacroix, S., Picard, F., 2018. High dimensional classification with combined adaptive sparse PLS and logistic regression. Bioinformatics 34, 485–493. doi:10.1093/bioinformatics/btx571. Available at http://arxiv.org/abs/1502.05933.

Chun, H., & Keles, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society. Series B (Methodological), 72(1), 3-25. doi:10.1111/j.1467-9868.2009.00723.x

Examples

### load plsgenomics library
library(plsgenomics)

### generating data
n <- 100
p <- 100
sample1 <- sample.cont(n=n, p=p, kstar=10, lstar=2, beta.min=0.25, 
                       beta.max=0.75, mean.H=0.2, sigma.H=10, 
                       sigma.F=5, sigma.E=5)
X <- sample1$X
Y <- sample1$Y
### splitting between learning and testing set
index.train <- sort(sample(1:n, size=round(0.7*n)))
index.test <- (1:n)[-index.train]
Xtrain <- X[index.train,]
Ytrain <- Y[index.train,]
Xtest <- X[index.test,]
Ytest <- Y[index.test,]

### fitting the model, and predicting new observations
model1 <- spls(Xtrain=Xtrain, Ytrain=Ytrain, lambda.l1=0.5, ncomp=2, 
               weight.mat=NULL, Xtest=Xtest, adapt=TRUE, center.X=TRUE, 
               center.Y=TRUE, scale.X=TRUE, scale.Y=TRUE, 
               weighted.center=FALSE)

str(model1)

### plotting the estimation versus real values for the non centered response
plot(model1$Ytrain, model1$hatY.nc, 
     xlab="real Ytrain", ylab="Ytrain estimates")
points(-1000:1000,-1000:1000, type="l")

### plotting residuals versus centered response values
plot(model1$sYtrain, model1$residuals, xlab="sYtrain", ylab="residuals")

### plotting the predictor coefficients
plot(model1$betahat.nc, xlab="variable index", ylab="coeff")

### mean squares error of prediction on test sample
sYtest <- as.matrix(scale(Ytest, center=model1$meanYtrain, scale=model1$sigmaYtrain))
sum((model1$hatYtest - sYtest)^2) / length(index.test)

### plotting predicted values versus non centered real response values 
## on the test set
plot(model1$hatYtest, sYtest, xlab="real Ytest", ylab="predicted values")
points(-1000:1000,-1000:1000, type="l")

[Package plsgenomics version 1.5-3 Index]