BiSelection {sharp} | R Documentation |
Stability selection of predictors and/or outcomes
Description
Performs stability selection for dimensionality reduction. The underlying variable selection algorithm (e.g. sparse PLS) is run with different combinations of parameters controlling the sparsity (e.g. number of selected variables per component) and thresholds in selection proportions. These hyper-parameters are jointly calibrated by maximisation of the stability score.
Usage
BiSelection(
xdata,
ydata = NULL,
group_x = NULL,
group_y = NULL,
LambdaX = NULL,
LambdaY = NULL,
AlphaX = NULL,
AlphaY = NULL,
ncomp = 1,
scale = TRUE,
pi_list = seq(0.01, 0.99, by = 0.01),
K = 100,
tau = 0.5,
seed = 1,
n_cat = NULL,
family = "gaussian",
implementation = SparsePLS,
resampling = "subsampling",
cpss = FALSE,
PFER_method = "MB",
PFER_thr = Inf,
FDP_thr = Inf,
n_cores = 1,
output_data = FALSE,
verbose = TRUE,
beep = NULL,
...
)
Arguments
xdata |
matrix of predictors with observations as rows and variables as columns. |
ydata |
optional vector or matrix of outcome(s). If |
group_x |
vector encoding the grouping structure among predictors. This
argument indicates the number of variables in each group. Only used for
models with group penalisation (e.g. |
group_y |
optional vector encoding the grouping structure among
outcomes. This argument indicates the number of variables in each group.
Only used if |
LambdaX |
matrix of parameters controlling the number of selected variables (for sparse PCA/PLS) or groups (for group and sparse group PLS) in X. |
LambdaY |
matrix of parameters controlling the number of selected
variables (for sparse PLS) or groups (for group or sparse group PLS) in Y.
Only used if |
AlphaX |
matrix of parameters controlling the level of sparsity within
groups in X. Only used if |
AlphaY |
matrix of parameters controlling the level of sparsity within
groups in X. Only used if |
ncomp |
number of components. |
scale |
logical indicating if the data should be scaled (i.e. transformed so that all variables have a standard deviation of one). |
pi_list |
vector of thresholds in selection proportions. If
|
K |
number of resampling iterations. |
tau |
subsample size. Only used if |
seed |
value of the seed to initialise the random number generator and
ensure reproducibility of the results (see |
n_cat |
computation options for the stability score. Default is
|
family |
type of PLS model. This parameter must be set to
|
implementation |
function to use for feature selection. Possible
functions are: |
resampling |
resampling approach. Possible values are:
|
cpss |
logical indicating if complementary pair stability selection
should be done. For this, the algorithm is applied on two non-overlapping
subsets of half of the observations. A feature is considered as selected if
it is selected for both subsamples. With this method, the data is split
|
PFER_method |
method used to compute the upper-bound of the expected
number of False Positives (or Per Family Error Rate, PFER). If
|
PFER_thr |
threshold in PFER for constrained calibration by error
control. If |
FDP_thr |
threshold in the expected proportion of falsely selected
features (or False Discovery Proportion) for constrained calibration by
error control. If |
n_cores |
number of cores to use for parallel computing (see argument
|
output_data |
logical indicating if the input datasets |
verbose |
logical indicating if a loading bar and messages should be printed. |
beep |
sound indicating the end of the run. Possible values are:
|
... |
additional parameters passed to the functions provided in
|
Details
In stability selection, a feature selection algorithm is fitted on
K
subsamples (or bootstrap samples) of the data with different
parameters controlling the sparsity (LambdaX
, LambdaY
,
AlphaX
, and/or AlphaY
). For a given (set of) sparsity
parameter(s), the proportion out of the K
models in which each
feature is selected is calculated. Features with selection proportions
above a threshold pi are considered stably selected. The stability
selection model is controlled by the sparsity parameter(s) (denoted by
\lambda
) for the underlying algorithm, and the threshold in selection
proportion:
V_{\lambda, \pi} = \{ j: p_{\lambda}(j) \ge \pi \}
For sparse and sparse group dimensionality reduction, "feature" refers to
variable (variable selection model). For group PLS, "feature" refers to
group (group selection model). For (sparse) group PLS, groups need to be
defined a priori and specified in arguments group_x
and/or
group_y
.
These parameters can be calibrated by maximisation of a stability score
(see ConsensusScore
if n_cat=NULL
or
StabilityScore
otherwise) calculated under the null
hypothesis of equiprobability of selection.
It is strongly recommended to examine the calibration plot carefully to
check that the grids of parameters Lambda
and pi_list
do not
restrict the calibration to a region that would not include the global
maximum (see CalibrationPlot
). In particular, the grid
Lambda
may need to be extended when the maximum stability is
observed on the left or right edges of the calibration heatmap. In some
instances, multiple peaks of stability score can be observed. Simulation
studies suggest that the peak corresponding to the largest number of
selected features tend to give better selection performances. This is not
necessarily the highest peak (which is automatically retained by the
functions in this package). The user can decide to manually choose another
peak.
To control the expected number of False Positives (Per Family Error Rate)
in the results, a threshold PFER_thr
can be specified. The
optimisation problem is then constrained to sets of parameters that
generate models with an upper-bound in PFER below PFER_thr
(see
Meinshausen and Bühlmann (2010) and Shah and Samworth (2013)).
Possible resampling procedures include defining (i) K
subsamples of
a proportion tau
of the observations, (ii) K
bootstrap samples
with the full sample size (obtained with replacement), and (iii) K/2
splits of the data in half for complementary pair stability selection (see
arguments resampling
and cpss
). In complementary pair
stability selection, a feature is considered selected at a given resampling
iteration if it is selected in the two complementary subsamples.
For categorical outcomes (argument family
is "binomial"
or
"multinomial"
), the proportions of observations from each category
in all subsamples or bootstrap samples are the same as in the full sample.
To ensure reproducibility of the results, the starting number of the random
number generator is set to seed
.
For parallelisation, stability selection with different sets of parameters
can be run on n_cores
cores. Using n_cores > 1
creates a
multisession
.
Value
An object of class bi_selection
. A list with:
summary |
a
matrix of the best stability scores and corresponding parameters
controlling the level of sparsity in the underlying algorithm for different
numbers of components. Possible columns include: |
summary_full |
a matrix of the best stability scores for different combinations of parameters controlling the sparsity and components. |
selectedX |
a binary matrix encoding stably selected predictors. |
selpropX |
a matrix of calibrated selection proportions for predictors. |
selectedY |
a binary matrix encoding stably selected outcomes. Only returned for PLS models. |
selpropY |
a matrix of calibrated selection proportions for outcomes. Only returned for PLS models. |
selected |
a binary matrix encoding stable relationships between predictor and outcome variables. Only returned for PLS models. |
selectedX_full |
a binary matrix encoding stably selected predictors. |
selpropX_full |
a matrix of selection proportions for predictors. |
selectedY_full |
a binary matrix encoding stably selected outcomes. Only returned for PLS models. |
selpropY_full |
a matrix of selection proportions for outcomes. Only returned for PLS models. |
coefX |
an
array of estimated loadings coefficients for the different components
(rows), for the predictors (columns), as obtained across the |
coefY |
an array of
estimated loadings coefficients for the different components (rows), for
the outcomes (columns), as obtained across the |
method |
a
list with |
params |
a list with values used
for arguments |
The rows of
summary
and columns of selectedX
, selectedY
,
selpropX
, selpropY
, selected
, coefX
and
coefY
are ordered in the same way and correspond to components and
parameter values stored in summary
. The rows of summary_full
and columns of selectedX_full
, selectedY_full
,
selpropX_full
and selpropY_full
are ordered in the same way
and correspond to components and parameter values stored in
summary_full
.
References
Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023). “Automated calibration for stability selection in penalised regression and graphical models.” Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058. ISSN 0035-9254, doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.
Shah RD, Samworth RJ (2013). “Variable selection with error control: another look at stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), 55-80. doi:10.1111/j.1467-9868.2011.01034.x.
Meinshausen N, Bühlmann P (2010). “Stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473. doi:10.1111/j.1467-9868.2010.00740.x.
Liquet B, de Micheaux PL, Hejblum BP, Thiébaut R (2016). “Group and sparse group partial least square approaches applied in genomics context.” Bioinformatics, 32(1), 35-42. ISSN 1367-4803, doi:10.1093/bioinformatics/btv535.
KA LC, Rossouw D, Robert-Granié C, Besse P (2008). “A sparse PLS for variable selection when integrating omics data.” Stat Appl Genet Mol Biol, 7(1), Article 35. ISSN 1544-6115, doi:10.2202/1544-6115.1390.
Shen H, Huang JZ (2008). “Sparse principal component analysis via regularized low rank matrix approximation.” Journal of Multivariate Analysis, 99(6), 1015-1034. ISSN 0047-259X, doi:10.1016/j.jmva.2007.06.007.
Zou H, Hastie T, Tibshirani R (2006). “Sparse Principal Component Analysis.” Journal of Computational and Graphical Statistics, 15(2), 265-286. doi:10.1198/106186006X113430.
See Also
SparsePCA
, SparsePLS
,
GroupPLS
, SparseGroupPLS
,
VariableSelection
, Resample
,
StabilityScore
Other stability functions:
Clustering()
,
GraphicalModel()
,
StructuralModel()
,
VariableSelection()
Examples
if (requireNamespace("sgPLS", quietly = TRUE)) {
oldpar <- par(no.readonly = TRUE)
par(mar = c(12, 5, 1, 1))
## Sparse Principal Component Analysis
# Data simulation
set.seed(1)
simul <- SimulateComponents(pk = c(5, 3, 4))
# sPCA: sparsity on X (unsupervised)
stab <- BiSelection(
xdata = simul$data,
ncomp = 2,
LambdaX = seq_len(ncol(simul$data) - 1),
implementation = SparsePCA
)
print(stab)
# Calibration plot
CalibrationPlot(stab)
# Visualisation of the results
summary(stab)
plot(stab)
SelectedVariables(stab)
## Sparse (Group) Partial Least Squares
# Data simulation (continuous outcomes)
set.seed(1)
simul <- SimulateRegression(n = 100, pk = 15, q = 3, family = "gaussian")
x <- simul$xdata
y <- simul$ydata
# sPLS: sparsity on X
stab <- BiSelection(
xdata = x, ydata = y,
family = "gaussian", ncomp = 3,
LambdaX = seq_len(ncol(x) - 1),
implementation = SparsePLS
)
CalibrationPlot(stab)
summary(stab)
plot(stab)
# sPLS: sparsity on both X and Y
stab <- BiSelection(
xdata = x, ydata = y,
family = "gaussian", ncomp = 3,
LambdaX = seq_len(ncol(x) - 1),
LambdaY = seq_len(ncol(y) - 1),
implementation = SparsePLS,
n_cat = 2
)
CalibrationPlot(stab)
summary(stab)
plot(stab)
# sgPLS: sparsity on X
stab <- BiSelection(
xdata = x, ydata = y, K = 10,
group_x = c(2, 8, 5),
family = "gaussian", ncomp = 3,
LambdaX = seq_len(2), AlphaX = seq(0.1, 0.9, by = 0.1),
implementation = SparseGroupPLS
)
CalibrationPlot(stab)
summary(stab)
par(oldpar)
}