spls.stab {plsgenomics} | R Documentation |
Stability selection procedure to estimate probabilities of selection of covariates for the sparse PLS method
Description
The function spls.stab
train a sparse PLS model for each
candidate values (ncomp, lambda.l1)
of hyper-parameters
on multiple sub-samplings in the data. The stability selection procedure
selects the covariates that are selected by most of the models among the
grid of hyper-parameters, following the procedure described in
Durif et al. (2018). Candidates values for ncomp
and lambda.l1
are respectively given by the input arguments ncomp.range
and
lambda.l1.range
.
Usage
spls.stab(
X,
Y,
lambda.l1.range,
ncomp.range,
weight.mat = NULL,
adapt = TRUE,
center.X = TRUE,
center.Y = TRUE,
scale.X = TRUE,
scale.Y = TRUE,
weighted.center = FALSE,
ncores = 1,
nresamp = 100,
seed = NULL,
verbose = TRUE
)
Arguments
X |
a (n x p) data matrix of predictors. |
Y |
a (n) vector of (continuous) responses. |
lambda.l1.range |
a vecor of positive real values, in [0,1].
|
ncomp.range |
a vector of positive integers. |
weight.mat |
a (ntrain x ntrain) matrix used to weight the l2 metric
in the observation space, it can be the covariance inverse of the Ytrain
observations in a heteroskedastic context. If NULL, the l2 metric is the
standard one, corresponding to homoskedastic model ( |
adapt |
a boolean value, indicating whether the sparse PLS selection step sould be adaptive or not (see details). |
center.X |
a boolean value indicating whether the data matrices
|
center.Y |
a boolean value indicating whether the response values
|
scale.X |
a boolean value indicating whether the data matrices
|
scale.Y |
a boolean value indicating whether the response values
|
weighted.center |
a boolean value indicating whether the centering should take into account the weighted l2 metric or not (if TRUE, it requires that weighted.mat is non NULL). |
ncores |
a positve integer, indicating the number of cores that the cross-validation is allowed to use for parallel computation (see details). |
nresamp |
number of resamplings of the data to estimate the probility of selection for each covariate, default is 100. |
seed |
a positive integer value (default is NULL). If non NULL, the seed for pseudo-random number generation is set accordingly. |
verbose |
a boolean parameter indicating the verbosity. |
Details
The columns of the data matrices X
may not be standardized,
since standardizing is performed by the function spls.stab
as a preliminary step.
The procedure is described in Durif et al. (2018). The stability selection procedure can be summarize as follow (c.f. Meinshausen and Buhlmann, 2010).
(i) For each candidate values (ncomp, lambda.l1)
of
hyper-parameters, a logit-SPLS is trained on nresamp
resamplings
of the data. Then, for each pair (ncomp, lambda.l1)
,
the probability that a covariate (i.e. a column in X
) is selected is
computed among the resamplings.
(ii) Eventually, the set of "stable selected" variables corresponds to the set of covariates that were selected by most of the training among the grid of hyper-parameters candidate values.
This function achieves the first step (i) of the stability selection
procedure. The second step (ii) is achieved by the function
stability.selection
.
This procedures uses mclapply
from the parallel
package,
available on GNU/Linux and MacOS. Users of Microsoft Windows can refer to
the README file in the source to be able to use a mclapply type function.
Value
An object with the following attributes
q.Lambda |
A table with values of q.Lambda (c.f. Durif et al. (2018) for the notation), being the averaged number of covariates selected among the entire grid of hyper-parameters candidates values, for increasing size of hyper-parameter grid. |
probs.lambda |
A table with estimated probability of selection for each covariates depending on the candidates values for hyper-parameters. |
p |
An integer values indicating the number of covariates in the model. |
Author(s)
Ghislain Durif (https://gdurif.perso.math.cnrs.fr/).
References
Durif, G., Modolo, L., Michaelsson, J., Mold, J.E., Lambert-Lacroix, S., Picard, F., 2018. High dimensional classification with combined adaptive sparse PLS and logistic regression. Bioinformatics 34, 485–493. doi:10.1093/bioinformatics/btx571. Available at http://arxiv.org/abs/1502.05933.
Meinshausen, N., Buhlmann P. (2010). Stability Selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72, no. 4, 417-473.
See Also
spls
, stability.selection
,
stability.selection.heatmap
Examples
## Not run:
### load plsgenomics library
library(plsgenomics)
### generating data
n <- 100
p <- 100
sample1 <- sample.cont(n=n, p=p, kstar=10, lstar=2,
beta.min=0.25, beta.max=0.75, mean.H=0.2,
sigma.H=10, sigma.F=5, sigma.E=5)
X <- sample1$X
Y <- sample1$Y
### hyper-parameters values to test
lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1
ncomp.range <- 1:10
### tuning the hyper-parameters
stab1 <- spls.stab(X=X, Y=Y, lambda.l1.range=lambda.l1.range,
ncomp.range=ncomp.range,
adapt=TRUE,
ncores=1, nresamp=100)
str(stab1)
### heatmap of estimated probabilities
stability.selection.heatmap(stab1)
### selected covariates
stability.selection(stab1, piThreshold=0.6, rhoError=10)
## End(Not run)