VariableSelection {sharp}R Documentation

Stability selection in regression

Description

Performs stability selection for regression models. The underlying variable selection algorithm (e.g. LASSO regression) is run with different combinations of parameters controlling the sparsity (e.g. penalty parameter) and thresholds in selection proportions. These two hyper-parameters are jointly calibrated by maximisation of the stability score.

Usage

VariableSelection(
  xdata,
  ydata = NULL,
  Lambda = NULL,
  pi_list = seq(0.01, 0.99, by = 0.01),
  K = 100,
  tau = 0.5,
  seed = 1,
  n_cat = NULL,
  family = "gaussian",
  implementation = PenalisedRegression,
  resampling = "subsampling",
  cpss = FALSE,
  PFER_method = "MB",
  PFER_thr = Inf,
  FDP_thr = Inf,
  Lambda_cardinal = 100,
  group_x = NULL,
  group_penalisation = FALSE,
  optimisation = c("grid_search", "nloptr"),
  n_cores = 1,
  output_data = FALSE,
  verbose = TRUE,
  beep = NULL,
  ...
)

Arguments

xdata

matrix of predictors with observations as rows and variables as columns.

ydata

optional vector or matrix of outcome(s). If family is set to "binomial" or "multinomial", ydata can be a vector with character/numeric values or a factor.

Lambda

matrix of parameters controlling the level of sparsity in the underlying feature selection algorithm specified in implementation. If Lambda=NULL and implementation=PenalisedRegression, LambdaGridRegression is used to define a relevant grid.

pi_list

vector of thresholds in selection proportions. If n_cat=NULL or n_cat=2, these values must be >0 and <1. If n_cat=3, these values must be >0.5 and <1.

K

number of resampling iterations.

tau

subsample size. Only used if resampling="subsampling" and cpss=FALSE.

seed

value of the seed to initialise the random number generator and ensure reproducibility of the results (see set.seed).

n_cat

computation options for the stability score. Default is NULL to use the score based on a z test. Other possible values are 2 or 3 to use the score based on the negative log-likelihood.

family

type of regression model. This argument is defined as in glmnet. Possible values include "gaussian" (linear regression), "binomial" (logistic regression), "multinomial" (multinomial regression), and "cox" (survival analysis).

implementation

function to use for variable selection. Possible functions are: PenalisedRegression, SparsePLS, GroupPLS and SparseGroupPLS. Alternatively, a user-defined function can be provided.

resampling

resampling approach. Possible values are: "subsampling" for sampling without replacement of a proportion tau of the observations, or "bootstrap" for sampling with replacement generating a resampled dataset with as many observations as in the full sample. Alternatively, this argument can be a function to use for resampling. This function must use arguments named data and tau and return the IDs of observations to be included in the resampled dataset.

cpss

logical indicating if complementary pair stability selection should be done. For this, the algorithm is applied on two non-overlapping subsets of half of the observations. A feature is considered as selected if it is selected for both subsamples. With this method, the data is split K/2 times (K models are fitted). Only used if PFER_method="MB".

PFER_method

method used to compute the upper-bound of the expected number of False Positives (or Per Family Error Rate, PFER). If PFER_method="MB", the method proposed by Meinshausen and Bühlmann (2010) is used. If PFER_method="SS", the method proposed by Shah and Samworth (2013) under the assumption of unimodality is used.

PFER_thr

threshold in PFER for constrained calibration by error control. If PFER_thr=Inf and FDP_thr=Inf, unconstrained calibration is used (the default).

FDP_thr

threshold in the expected proportion of falsely selected features (or False Discovery Proportion) for constrained calibration by error control. If PFER_thr=Inf and FDP_thr=Inf, unconstrained calibration is used (the default).

Lambda_cardinal

number of values in the grid of parameters controlling the level of sparsity in the underlying algorithm. Only used if Lambda=NULL.

group_x

vector encoding the grouping structure among predictors. This argument indicates the number of variables in each group. Only used for models with group penalisation (e.g. implementation=GroupPLS or implementation=SparseGroupPLS).

group_penalisation

logical indicating if a group penalisation should be considered in the stability score. The use of group_penalisation=TRUE strictly applies to group (not sparse-group) penalisation.

optimisation

character string indicating the type of optimisation method. With optimisation="grid_search" (the default), all values in Lambda are visited. Alternatively, optimisation algorithms implemented in nloptr can be used with optimisation="nloptr". By default, we use "algorithm"="NLOPT_GN_DIRECT_L", "xtol_abs"=0.1, "ftol_abs"=0.1 and "maxeval"=Lambda_cardinal. These values can be changed by providing the argument opts (see nloptr). For stability selection using penalised regression, optimisation="grid_search" may be faster as it allows for warm start.

n_cores

number of cores to use for parallel computing (see argument workers in multisession). Using n_cores>1 is only supported with optimisation="grid_search".

output_data

logical indicating if the input datasets xdata and ydata should be included in the output.

verbose

logical indicating if a loading bar and messages should be printed.

beep

sound indicating the end of the run. Possible values are: NULL (no sound) or an integer between 1 and 11 (see argument sound in beep).

...

additional parameters passed to the functions provided in implementation or resampling.

Details

In stability selection, a feature selection algorithm is fitted on K subsamples (or bootstrap samples) of the data with different parameters controlling the sparsity (Lambda). For a given (set of) sparsity parameter(s), the proportion out of the K models in which each feature is selected is calculated. Features with selection proportions above a threshold pi are considered stably selected. The stability selection model is controlled by the sparsity parameter(s) for the underlying algorithm, and the threshold in selection proportion:

V_{\lambda, \pi} = \{ j: p_{\lambda}(j) \ge \pi \}

If argument group_penalisation=FALSE, "feature" refers to variable (variable selection model). If argument group_penalisation=TRUE, "feature" refers to group (group selection model). In this case, groups need to be defined a priori and specified in argument group_x.

These parameters can be calibrated by maximisation of a stability score (see ConsensusScore if n_cat=NULL or StabilityScore otherwise) calculated under the null hypothesis of equiprobability of selection.

It is strongly recommended to examine the calibration plot carefully to check that the grids of parameters Lambda and pi_list do not restrict the calibration to a region that would not include the global maximum (see CalibrationPlot). In particular, the grid Lambda may need to be extended when the maximum stability is observed on the left or right edges of the calibration heatmap. In some instances, multiple peaks of stability score can be observed. Simulation studies suggest that the peak corresponding to the largest number of selected features tend to give better selection performances. This is not necessarily the highest peak (which is automatically retained by the functions in this package). The user can decide to manually choose another peak.

To control the expected number of False Positives (Per Family Error Rate) in the results, a threshold PFER_thr can be specified. The optimisation problem is then constrained to sets of parameters that generate models with an upper-bound in PFER below PFER_thr (see Meinshausen and Bühlmann (2010) and Shah and Samworth (2013)).

Possible resampling procedures include defining (i) K subsamples of a proportion tau of the observations, (ii) K bootstrap samples with the full sample size (obtained with replacement), and (iii) K/2 splits of the data in half for complementary pair stability selection (see arguments resampling and cpss). In complementary pair stability selection, a feature is considered selected at a given resampling iteration if it is selected in the two complementary subsamples.

For categorical or time to event outcomes (argument family is "binomial", "multinomial" or "cox"), the proportions of observations from each category in all subsamples or bootstrap samples are the same as in the full sample.

To ensure reproducibility of the results, the starting number of the random number generator is set to seed.

For parallelisation, stability selection with different sets of parameters can be run on n_cores cores. Using n_cores > 1 creates a multisession. Alternatively, the function can be run manually with different seeds and all other parameters equal. The results can then be combined using Combine.

Value

An object of class variable_selection. A list with:

S

a matrix of the best stability scores for different parameters controlling the level of sparsity in the underlying algorithm.

Lambda

a matrix of parameters controlling the level of sparsity in the underlying algorithm.

Q

a matrix of the average number of selected features by the underlying algorithm with different parameters controlling the level of sparsity.

Q_s

a matrix of the calibrated number of stably selected features with different parameters controlling the level of sparsity.

P

a matrix of calibrated thresholds in selection proportions for different parameters controlling the level of sparsity in the underlying algorithm.

PFER

a matrix of upper-bounds in PFER of calibrated stability selection models with different parameters controlling the level of sparsity.

FDP

a matrix of upper-bounds in FDP of calibrated stability selection models with different parameters controlling the level of sparsity.

S_2d

a matrix of stability scores obtained with different combinations of parameters. Columns correspond to different thresholds in selection proportions.

PFER_2d

a matrix of upper-bounds in FDP obtained with different combinations of parameters. Columns correspond to different thresholds in selection proportions.

FDP_2d

a matrix of upper-bounds in PFER obtained with different combinations of parameters. Columns correspond to different thresholds in selection proportions.

selprop

a matrix of selection proportions. Columns correspond to predictors from xdata.

Beta

an array of model coefficients. Columns correspond to predictors from xdata. Indices along the third dimension correspond to different resampling iterations. With multivariate outcomes, indices along the fourth dimension correspond to outcome-specific coefficients.

method

a list with type="variable_selection" and values used for arguments implementation, family, resampling, cpss and PFER_method.

params

a list with values used for arguments K, pi_list, tau, n_cat, pk, n (number of observations), PFER_thr, FDP_thr and seed. The datasets xdata and ydata are also included if output_data=TRUE.

For all matrices and arrays returned, the rows are ordered in the same way and correspond to parameter values stored in Lambda.

References

Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023). “Automated calibration for stability selection in penalised regression and graphical models.” Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058. ISSN 0035-9254, doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.

Shah RD, Samworth RJ (2013). “Variable selection with error control: another look at stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), 55-80. doi:10.1111/j.1467-9868.2011.01034.x.

Meinshausen N, Bühlmann P (2010). “Stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473. doi:10.1111/j.1467-9868.2010.00740.x.

Tibshirani R (1996). “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288. ISSN 00359246, http://www.jstor.org/stable/2346178.

See Also

PenalisedRegression, SelectionAlgo, LambdaGridRegression, Resample, StabilityScore Refit, ExplanatoryPerformance, Incremental,

Other stability functions: BiSelection(), Clustering(), GraphicalModel(), StructuralModel()

Examples


oldpar <- par(no.readonly = TRUE)
par(mar = rep(7, 4))

# Linear regression
set.seed(1)
simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian")
stab <- VariableSelection(
  xdata = simul$xdata, ydata = simul$ydata,
  family = "gaussian"
)

# Calibration plot
CalibrationPlot(stab)

# Extracting the results
summary(stab)
Stable(stab)
SelectionProportions(stab)
plot(stab)

# Using randomised LASSO
stab <- VariableSelection(
  xdata = simul$xdata, ydata = simul$ydata,
  family = "gaussian", penalisation = "randomised"
)
plot(stab)

# Using adaptive LASSO
stab <- VariableSelection(
  xdata = simul$xdata, ydata = simul$ydata,
  family = "gaussian", penalisation = "adaptive"
)
plot(stab)

# Using additional arguments from glmnet (e.g. penalty.factor)
stab <- VariableSelection(
  xdata = simul$xdata, ydata = simul$ydata, family = "gaussian",
  penalty.factor = c(rep(1, 45), rep(0, 5))
)
head(coef(stab))

# Using CART
if (requireNamespace("rpart", quietly = TRUE)) {
  stab <- VariableSelection(
    xdata = simul$xdata, ydata = simul$ydata,
    implementation = CART,
    family = "gaussian",
  )
  plot(stab)
}

# Regression with multivariate outcomes
set.seed(1)
simul <- SimulateRegression(n = 100, pk = 20, q = 3, family = "gaussian")
stab <- VariableSelection(
  xdata = simul$xdata, ydata = simul$ydata,
  family = "mgaussian"
)
summary(stab)

# Logistic regression
set.seed(1)
simul <- SimulateRegression(n = 200, pk = 10, family = "binomial", ev_xy = 0.8)
stab <- VariableSelection(
  xdata = simul$xdata, ydata = simul$ydata,
  family = "binomial"
)
summary(stab)

# Sparse PCA (1 component, see BiSelection for more components)
if (requireNamespace("elasticnet", quietly = TRUE)) {
  set.seed(1)
  simul <- SimulateComponents(pk = c(5, 3, 4))
  stab <- VariableSelection(
    xdata = simul$data,
    Lambda = seq_len(ncol(simul$data) - 1),
    implementation = SparsePCA
  )
  CalibrationPlot(stab, xlab = "")
  summary(stab)
}

# Sparse PLS (1 outcome, 1 component, see BiSelection for more options)
if (requireNamespace("sgPLS", quietly = TRUE)) {
  set.seed(1)
  simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian")
  stab <- VariableSelection(
    xdata = simul$xdata, ydata = simul$ydata,
    Lambda = seq_len(ncol(simul$xdata) - 1),
    implementation = SparsePLS, family = "gaussian"
  )
  CalibrationPlot(stab, xlab = "")
  SelectedVariables(stab)
}

# Group PLS (1 outcome, 1 component, see BiSelection for more options)
if (requireNamespace("sgPLS", quietly = TRUE)) {
  stab <- VariableSelection(
    xdata = simul$xdata, ydata = simul$ydata,
    Lambda = seq_len(5),
    group_x = c(5, 5, 10, 20, 10),
    group_penalisation = TRUE,
    implementation = GroupPLS, family = "gaussian"
  )
  CalibrationPlot(stab, xlab = "")
  SelectedVariables(stab)
}

# Example with more hyper-parameters: elastic net
set.seed(1)
simul <- SimulateRegression(n = 100, pk = 50, family = "gaussian")
TuneElasticNet <- function(xdata, ydata, family, alpha) {
  stab <- VariableSelection(
    xdata = xdata, ydata = ydata,
    family = family, alpha = alpha, verbose = FALSE
  )
  return(max(stab$S, na.rm = TRUE))
}
myopt <- optimise(TuneElasticNet,
  lower = 0.1, upper = 1, maximum = TRUE,
  xdata = simul$xdata, ydata = simul$ydata,
  family = "gaussian"
)
stab <- VariableSelection(
  xdata = simul$xdata, ydata = simul$ydata,
  family = "gaussian", alpha = myopt$maximum
)
summary(stab)
enet <- SelectedVariables(stab)

# Comparison with LASSO
stab <- VariableSelection(xdata = simul$xdata, ydata = simul$ydata, family = "gaussian")
summary(stab)
lasso <- SelectedVariables(stab)
table(lasso, enet)

# Example using an external function: group-LASSO with gglasso
if (requireNamespace("gglasso", quietly = TRUE)) {
  set.seed(1)
  simul <- SimulateRegression(n = 200, pk = 20, family = "binomial")
  ManualGridGroupLasso <- function(xdata, ydata, family, group_x, ...) {
    # Defining the grouping
    group <- do.call(c, lapply(seq_len(length(group_x)), FUN = function(i) {
      rep(i, group_x[i])
    }))

    if (family == "binomial") {
      ytmp <- ydata
      ytmp[ytmp == min(ytmp)] <- -1
      ytmp[ytmp == max(ytmp)] <- 1
      return(gglasso::gglasso(xdata, ytmp, loss = "logit", group = group, ...))
    } else {
      return(gglasso::gglasso(xdata, ydata, lambda = lambda, loss = "ls", group = group, ...))
    }
  }
  Lambda <- LambdaGridRegression(
    xdata = simul$xdata, ydata = simul$ydata,
    family = "binomial", Lambda_cardinal = 20,
    implementation = ManualGridGroupLasso,
    group_x = rep(5, 4)
  )
  GroupLasso <- function(xdata, ydata, Lambda, family, group_x, ...) {
    # Defining the grouping
    group <- do.call(c, lapply(seq_len(length(group_x)), FUN = function(i) {
      rep(i, group_x[i])
    }))

    # Running the regression
    if (family == "binomial") {
      ytmp <- ydata
      ytmp[ytmp == min(ytmp)] <- -1
      ytmp[ytmp == max(ytmp)] <- 1
      mymodel <- gglasso::gglasso(xdata, ytmp, lambda = Lambda, loss = "logit", group = group, ...)
    }
    if (family == "gaussian") {
      mymodel <- gglasso::gglasso(xdata, ydata, lambda = Lambda, loss = "ls", group = group, ...)
    }
    # Extracting and formatting the beta coefficients
    beta_full <- t(as.matrix(mymodel$beta))
    beta_full <- beta_full[, colnames(xdata)]

    selected <- ifelse(beta_full != 0, yes = 1, no = 0)

    return(list(selected = selected, beta_full = beta_full))
  }
  stab <- VariableSelection(
    xdata = simul$xdata, ydata = simul$ydata,
    implementation = GroupLasso, family = "binomial", Lambda = Lambda,
    group_x = rep(5, 4),
    group_penalisation = TRUE
  )
  summary(stab)
}

par(oldpar)



[Package sharp version 1.4.6 Index]