VariableSelection {sharp} | R Documentation |
Stability selection in regression
Description
Performs stability selection for regression models. The underlying variable selection algorithm (e.g. LASSO regression) is run with different combinations of parameters controlling the sparsity (e.g. penalty parameter) and thresholds in selection proportions. These two hyper-parameters are jointly calibrated by maximisation of the stability score.
Usage
VariableSelection(
xdata,
ydata = NULL,
Lambda = NULL,
pi_list = seq(0.01, 0.99, by = 0.01),
K = 100,
tau = 0.5,
seed = 1,
n_cat = NULL,
family = "gaussian",
implementation = PenalisedRegression,
resampling = "subsampling",
cpss = FALSE,
PFER_method = "MB",
PFER_thr = Inf,
FDP_thr = Inf,
Lambda_cardinal = 100,
group_x = NULL,
group_penalisation = FALSE,
optimisation = c("grid_search", "nloptr"),
n_cores = 1,
output_data = FALSE,
verbose = TRUE,
beep = NULL,
...
)
Arguments
xdata |
matrix of predictors with observations as rows and variables as columns. |
ydata |
optional vector or matrix of outcome(s). If |
Lambda |
matrix of parameters controlling the level of sparsity in the
underlying feature selection algorithm specified in |
pi_list |
vector of thresholds in selection proportions. If
|
K |
number of resampling iterations. |
tau |
subsample size. Only used if |
seed |
value of the seed to initialise the random number generator and
ensure reproducibility of the results (see |
n_cat |
computation options for the stability score. Default is
|
family |
type of regression model. This argument is defined as in
|
implementation |
function to use for variable selection. Possible
functions are: |
resampling |
resampling approach. Possible values are:
|
cpss |
logical indicating if complementary pair stability selection
should be done. For this, the algorithm is applied on two non-overlapping
subsets of half of the observations. A feature is considered as selected if
it is selected for both subsamples. With this method, the data is split
|
PFER_method |
method used to compute the upper-bound of the expected
number of False Positives (or Per Family Error Rate, PFER). If
|
PFER_thr |
threshold in PFER for constrained calibration by error
control. If |
FDP_thr |
threshold in the expected proportion of falsely selected
features (or False Discovery Proportion) for constrained calibration by
error control. If |
Lambda_cardinal |
number of values in the grid of parameters controlling
the level of sparsity in the underlying algorithm. Only used if
|
group_x |
vector encoding the grouping structure among predictors. This
argument indicates the number of variables in each group. Only used for
models with group penalisation (e.g. |
group_penalisation |
logical indicating if a group penalisation should
be considered in the stability score. The use of
|
optimisation |
character string indicating the type of optimisation
method. With |
n_cores |
number of cores to use for parallel computing (see argument
|
output_data |
logical indicating if the input datasets |
verbose |
logical indicating if a loading bar and messages should be printed. |
beep |
sound indicating the end of the run. Possible values are:
|
... |
additional parameters passed to the functions provided in
|
Details
In stability selection, a feature selection algorithm is fitted on
K
subsamples (or bootstrap samples) of the data with different
parameters controlling the sparsity (Lambda
). For a given (set of)
sparsity parameter(s), the proportion out of the K
models in which
each feature is selected is calculated. Features with selection proportions
above a threshold pi are considered stably selected. The stability
selection model is controlled by the sparsity parameter(s) for the
underlying algorithm, and the threshold in selection proportion:
V_{\lambda, \pi} = \{ j: p_{\lambda}(j) \ge \pi \}
If argument group_penalisation=FALSE
, "feature" refers to variable
(variable selection model). If argument group_penalisation=TRUE
,
"feature" refers to group (group selection model). In this case, groups
need to be defined a priori and specified in argument
group_x
.
These parameters can be calibrated by maximisation of a stability score
(see ConsensusScore
if n_cat=NULL
or
StabilityScore
otherwise) calculated under the null
hypothesis of equiprobability of selection.
It is strongly recommended to examine the calibration plot carefully to
check that the grids of parameters Lambda
and pi_list
do not
restrict the calibration to a region that would not include the global
maximum (see CalibrationPlot
). In particular, the grid
Lambda
may need to be extended when the maximum stability is
observed on the left or right edges of the calibration heatmap. In some
instances, multiple peaks of stability score can be observed. Simulation
studies suggest that the peak corresponding to the largest number of
selected features tend to give better selection performances. This is not
necessarily the highest peak (which is automatically retained by the
functions in this package). The user can decide to manually choose another
peak.
To control the expected number of False Positives (Per Family Error Rate)
in the results, a threshold PFER_thr
can be specified. The
optimisation problem is then constrained to sets of parameters that
generate models with an upper-bound in PFER below PFER_thr
(see
Meinshausen and Bühlmann (2010) and Shah and Samworth (2013)).
Possible resampling procedures include defining (i) K
subsamples of
a proportion tau
of the observations, (ii) K
bootstrap
samples with the full sample size (obtained with replacement), and (iii)
K/2
splits of the data in half for complementary pair stability
selection (see arguments resampling
and cpss
). In
complementary pair stability selection, a feature is considered selected at
a given resampling iteration if it is selected in the two complementary
subsamples.
For categorical or time to event outcomes (argument family
is
"binomial"
, "multinomial"
or "cox"
), the proportions
of observations from each category in all subsamples or bootstrap samples
are the same as in the full sample.
To ensure reproducibility of the results, the starting number of the random
number generator is set to seed
.
For parallelisation, stability selection with different sets of parameters
can be run on n_cores
cores. Using n_cores > 1
creates a
multisession
. Alternatively,
the function can be run manually with different seed
s and all other
parameters equal. The results can then be combined using
Combine
.
Value
An object of class variable_selection
. A list with:
S |
a matrix of the best stability scores for different parameters controlling the level of sparsity in the underlying algorithm. |
Lambda |
a matrix of parameters controlling the level of sparsity in the underlying algorithm. |
Q |
a matrix of the average number of selected features by the underlying algorithm with different parameters controlling the level of sparsity. |
Q_s |
a matrix of the calibrated number of stably selected features with different parameters controlling the level of sparsity. |
P |
a matrix of calibrated thresholds in selection proportions for different parameters controlling the level of sparsity in the underlying algorithm. |
PFER |
a matrix of upper-bounds in PFER of calibrated stability selection models with different parameters controlling the level of sparsity. |
FDP |
a matrix of upper-bounds in FDP of calibrated stability selection models with different parameters controlling the level of sparsity. |
S_2d |
a matrix of stability scores obtained with different combinations of parameters. Columns correspond to different thresholds in selection proportions. |
PFER_2d |
a matrix of upper-bounds in FDP obtained with different combinations of parameters. Columns correspond to different thresholds in selection proportions. |
FDP_2d |
a matrix of upper-bounds in PFER obtained with different combinations of parameters. Columns correspond to different thresholds in selection proportions. |
selprop |
a matrix of selection proportions.
Columns correspond to predictors from |
Beta |
an array
of model coefficients. Columns correspond to predictors from |
method |
a list with
|
params |
a list with values used for arguments
|
For all matrices and arrays returned, the rows
are ordered in the same way and correspond to parameter values stored in
Lambda
.
References
Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023). “Automated calibration for stability selection in penalised regression and graphical models.” Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058. ISSN 0035-9254, doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.
Shah RD, Samworth RJ (2013). “Variable selection with error control: another look at stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), 55-80. doi:10.1111/j.1467-9868.2011.01034.x.
Meinshausen N, Bühlmann P (2010). “Stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473. doi:10.1111/j.1467-9868.2010.00740.x.
Tibshirani R (1996). “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288. ISSN 00359246, http://www.jstor.org/stable/2346178.
See Also
PenalisedRegression
, SelectionAlgo
,
LambdaGridRegression
, Resample
,
StabilityScore
Refit
,
ExplanatoryPerformance
, Incremental
,
Other stability functions:
BiSelection()
,
Clustering()
,
GraphicalModel()
,
StructuralModel()
Examples
oldpar <- par(no.readonly = TRUE)
par(mar = rep(7, 4))
# Linear regression
set.seed(1)
simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian")
stab <- VariableSelection(
xdata = simul$xdata, ydata = simul$ydata,
family = "gaussian"
)
# Calibration plot
CalibrationPlot(stab)
# Extracting the results
summary(stab)
Stable(stab)
SelectionProportions(stab)
plot(stab)
# Using randomised LASSO
stab <- VariableSelection(
xdata = simul$xdata, ydata = simul$ydata,
family = "gaussian", penalisation = "randomised"
)
plot(stab)
# Using adaptive LASSO
stab <- VariableSelection(
xdata = simul$xdata, ydata = simul$ydata,
family = "gaussian", penalisation = "adaptive"
)
plot(stab)
# Using additional arguments from glmnet (e.g. penalty.factor)
stab <- VariableSelection(
xdata = simul$xdata, ydata = simul$ydata, family = "gaussian",
penalty.factor = c(rep(1, 45), rep(0, 5))
)
head(coef(stab))
# Using CART
if (requireNamespace("rpart", quietly = TRUE)) {
stab <- VariableSelection(
xdata = simul$xdata, ydata = simul$ydata,
implementation = CART,
family = "gaussian",
)
plot(stab)
}
# Regression with multivariate outcomes
set.seed(1)
simul <- SimulateRegression(n = 100, pk = 20, q = 3, family = "gaussian")
stab <- VariableSelection(
xdata = simul$xdata, ydata = simul$ydata,
family = "mgaussian"
)
summary(stab)
# Logistic regression
set.seed(1)
simul <- SimulateRegression(n = 200, pk = 10, family = "binomial", ev_xy = 0.8)
stab <- VariableSelection(
xdata = simul$xdata, ydata = simul$ydata,
family = "binomial"
)
summary(stab)
# Sparse PCA (1 component, see BiSelection for more components)
if (requireNamespace("elasticnet", quietly = TRUE)) {
set.seed(1)
simul <- SimulateComponents(pk = c(5, 3, 4))
stab <- VariableSelection(
xdata = simul$data,
Lambda = seq_len(ncol(simul$data) - 1),
implementation = SparsePCA
)
CalibrationPlot(stab, xlab = "")
summary(stab)
}
# Sparse PLS (1 outcome, 1 component, see BiSelection for more options)
if (requireNamespace("sgPLS", quietly = TRUE)) {
set.seed(1)
simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian")
stab <- VariableSelection(
xdata = simul$xdata, ydata = simul$ydata,
Lambda = seq_len(ncol(simul$xdata) - 1),
implementation = SparsePLS, family = "gaussian"
)
CalibrationPlot(stab, xlab = "")
SelectedVariables(stab)
}
# Group PLS (1 outcome, 1 component, see BiSelection for more options)
if (requireNamespace("sgPLS", quietly = TRUE)) {
stab <- VariableSelection(
xdata = simul$xdata, ydata = simul$ydata,
Lambda = seq_len(5),
group_x = c(5, 5, 10, 20, 10),
group_penalisation = TRUE,
implementation = GroupPLS, family = "gaussian"
)
CalibrationPlot(stab, xlab = "")
SelectedVariables(stab)
}
# Example with more hyper-parameters: elastic net
set.seed(1)
simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian")
TuneElasticNet <- function(xdata, ydata, family, alpha) {
stab <- VariableSelection(
xdata = xdata, ydata = ydata,
family = family, alpha = alpha, verbose = FALSE
)
return(max(stab$S, na.rm = TRUE))
}
myopt <- optimise(TuneElasticNet,
lower = 0.1, upper = 1, maximum = TRUE,
xdata = simul$xdata, ydata = simul$ydata,
family = "gaussian"
)
stab <- VariableSelection(
xdata = simul$xdata, ydata = simul$ydata,
family = "gaussian", alpha = myopt$maximum
)
summary(stab)
enet <- SelectedVariables(stab)
# Comparison with LASSO
stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, family = "gaussian")
summary(stab)
lasso <- SelectedVariables(stab)
table(lasso, enet)
# Example using an external function: group-LASSO with gglasso
if (requireNamespace("gglasso", quietly = TRUE)) {
set.seed(1)
simul <- SimulateRegression(n = 200, pk = 20, family = "binomial")
ManualGridGroupLasso <- function(xdata, ydata, family, group_x, ...) {
# Defining the grouping
group <- do.call(c, lapply(seq_len(length(group_x)), FUN = function(i) {
rep(i, group_x[i])
}))
if (family == "binomial") {
ytmp <- ydata
ytmp[ytmp == min(ytmp)] <- -1
ytmp[ytmp == max(ytmp)] <- 1
return(gglasso::gglasso(xdata, ytmp, loss = "logit", group = group, ...))
} else {
return(gglasso::gglasso(xdata, ydata, lambda = lambda, loss = "ls", group = group, ...))
}
}
Lambda <- LambdaGridRegression(
xdata = simul$xdata, ydata = simul$ydata,
family = "binomial", Lambda_cardinal = 20,
implementation = ManualGridGroupLasso,
group_x = rep(5, 4)
)
GroupLasso <- function(xdata, ydata, Lambda, family, group_x, ...) {
# Defining the grouping
group <- do.call(c, lapply(seq_len(length(group_x)), FUN = function(i) {
rep(i, group_x[i])
}))
# Running the regression
if (family == "binomial") {
ytmp <- ydata
ytmp[ytmp == min(ytmp)] <- -1
ytmp[ytmp == max(ytmp)] <- 1
mymodel <- gglasso::gglasso(xdata, ytmp, lambda = Lambda, loss = "logit", group = group, ...)
}
if (family == "gaussian") {
mymodel <- gglasso::gglasso(xdata, ydata, lambda = Lambda, loss = "ls", group = group, ...)
}
# Extracting and formatting the beta coefficients
beta_full <- t(as.matrix(mymodel$beta))
beta_full <- beta_full[, colnames(xdata)]
selected <- ifelse(beta_full != 0, yes = 1, no = 0)
return(list(selected = selected, beta_full = beta_full))
}
stab <- VariableSelection(
xdata = simul$xdata, ydata = simul$ydata,
implementation = GroupLasso, family = "binomial", Lambda = Lambda,
group_x = rep(5, 4),
group_penalisation = TRUE
)
summary(stab)
}
par(oldpar)