R: Stability selection graphical model

GraphicalModel {sharp}

R Documentation

Stability selection graphical model

Description

Performs stability selection for graphical models. The underlying graphical model (e.g. graphical LASSO) is run with different combinations of parameters controlling the sparsity (e.g. penalty parameter) and thresholds in selection proportions. These two hyper-parameters are jointly calibrated by maximisation of the stability score.

Usage

GraphicalModel(
  xdata,
  pk = NULL,
  Lambda = NULL,
  lambda_other_blocks = 0.1,
  pi_list = seq(0.01, 0.99, by = 0.01),
  K = 100,
  tau = 0.5,
  seed = 1,
  n_cat = NULL,
  implementation = PenalisedGraphical,
  start = "warm",
  scale = TRUE,
  resampling = "subsampling",
  cpss = FALSE,
  PFER_method = "MB",
  PFER_thr = Inf,
  FDP_thr = Inf,
  Lambda_cardinal = 50,
  lambda_max = NULL,
  lambda_path_factor = 0.001,
  max_density = 0.5,
  optimisation = c("grid_search", "nloptr"),
  n_cores = 1,
  output_data = FALSE,
  verbose = TRUE,
  beep = NULL,
  ...
)

Arguments

`xdata`	data matrix with observations as rows and variables as columns. For multi-block stability selection, the variables in data have to be ordered by group.
`pk`	optional vector encoding the grouping structure. Only used for multi-block stability selection where `pk` indicates the number of variables in each group. If `pk=NULL`, single-block stability selection is performed.
`Lambda`	matrix of parameters controlling the level of sparsity in the underlying feature selection algorithm specified in `implementation`. If `Lambda=NULL` and `implementation=PenalisedGraphical`, `LambdaGridGraphical` is used to define a relevant grid. `Lambda` can be provided as a vector or a matrix with `length(pk)` columns.
`lambda_other_blocks`	optional vector of parameters controlling the level of sparsity in neighbour blocks for the multi-block procedure. To use jointly a specific set of parameters for each block, `lambda_other_blocks` must be set to `NULL` (not recommended). Only used for multi-block stability selection, i.e. if `length(pk)>1`.
`pi_list`	vector of thresholds in selection proportions. If `n_cat=NULL` or `n_cat=2`, these values must be `>0` and `<1`. If `n_cat=3`, these values must be `>0.5` and `<1`.
`K`	number of resampling iterations.
`tau`	subsample size. Only used if `resampling="subsampling"` and `cpss=FALSE`.
`seed`	value of the seed to initialise the random number generator and ensure reproducibility of the results (see `set.seed`).
`n_cat`	computation options for the stability score. Default is `NULL` to use the score based on a z test. Other possible values are 2 or 3 to use the score based on the negative log-likelihood.
`implementation`	function to use for graphical modelling. If `implementation=PenalisedGraphical`, the algorithm implemented in `glassoFast` is used for regularised estimation of a conditional independence graph. Alternatively, a user-defined function can be provided.
`start`	character string indicating if the algorithm should be initialised at the estimated (inverse) covariance with previous penalty parameters (`start="warm"`) or not (`start="cold"`). Using `start="warm"` can speed-up the computations, but could lead to convergence issues (in particular with small `Lambda_cardinal`). Only used for `implementation=PenalisedGraphical` (see argument `"start"` in `glassoFast`).
`scale`	logical indicating if the correlation (`scale=TRUE`) or covariance (`scale=FALSE`) matrix should be used as input of `glassoFast` if `implementation=PenalisedGraphical`. Otherwise, this argument must be used in the function provided in `implementation`.
`resampling`	resampling approach. Possible values are: `"subsampling"` for sampling without replacement of a proportion `tau` of the observations, or `"bootstrap"` for sampling with replacement generating a resampled dataset with as many observations as in the full sample. Alternatively, this argument can be a function to use for resampling. This function must use arguments named `data` and `tau` and return the IDs of observations to be included in the resampled dataset.
`cpss`	logical indicating if complementary pair stability selection should be done. For this, the algorithm is applied on two non-overlapping subsets of half of the observations. A feature is considered as selected if it is selected for both subsamples. With this method, the data is split `K/2` times (`K` models are fitted). Only used if `PFER_method="MB"`.
`PFER_method`	method used to compute the upper-bound of the expected number of False Positives (or Per Family Error Rate, PFER). If `PFER_method="MB"`, the method proposed by Meinshausen and Bühlmann (2010) is used. If `PFER_method="SS"`, the method proposed by Shah and Samworth (2013) under the assumption of unimodality is used.
`PFER_thr`	threshold in PFER for constrained calibration by error control. If `PFER_thr=Inf` and `FDP_thr=Inf`, unconstrained calibration is used (the default).
`FDP_thr`	threshold in the expected proportion of falsely selected features (or False Discovery Proportion) for constrained calibration by error control. If `PFER_thr=Inf` and `FDP_thr=Inf`, unconstrained calibration is used (the default).
`Lambda_cardinal`	number of values in the grid of parameters controlling the level of sparsity in the underlying algorithm. Only used if `Lambda=NULL`.
`lambda_max`	optional maximum value for the grid in penalty parameters. If `lambda_max=NULL`, the maximum value is set to the maximum covariance in absolute value. Only used if `implementation=PenalisedGraphical` and `Lambda=NULL`.
`lambda_path_factor`	multiplicative factor used to define the minimum value in the grid.
`max_density`	threshold on the density. The grid is defined such that the density of the estimated graph does not exceed max_density.
`optimisation`	character string indicating the type of optimisation method. With `optimisation="grid_search"` (the default), all values in `Lambda` are visited. Alternatively, optimisation algorithms implemented in `nloptr` can be used with `optimisation="nloptr"`. By default, we use `"algorithm"="NLOPT_GN_DIRECT_L"`, `"xtol_abs"=0.1`, `"ftol_abs"=0.1` and `"maxeval"=Lambda_cardinal`. These values can be changed by providing the argument `opts` (see `nloptr`). For stability selection using penalised regression, `optimisation="grid_search"` may be faster as it allows for warm start.
`n_cores`	number of cores to use for parallel computing (see argument `workers` in `multisession`). Using `n_cores>1` is only supported with `optimisation="grid_search"`.
`output_data`	logical indicating if the input datasets `xdata` and `ydata` should be included in the output.
`verbose`	logical indicating if a loading bar and messages should be printed.
`beep`	sound indicating the end of the run. Possible values are: `NULL` (no sound) or an integer between 1 and 11 (see argument `sound` in `beep`).
`...`	additional parameters passed to the functions provided in `implementation` or `resampling`.

Details

In stability selection, a feature selection algorithm is fitted on K subsamples (or bootstrap samples) of the data with different parameters controlling the sparsity (Lambda). For a given (set of) sparsity parameter(s), the proportion out of the K models in which each feature is selected is calculated. Features with selection proportions above a threshold pi are considered stably selected. The stability selection model is controlled by the sparsity parameter(s) for the underlying algorithm, and the threshold in selection proportion:

V_{\lambda, \pi} = \{ j: p_{\lambda}(j) \ge \pi \}

These parameters can be calibrated by maximisation of a stability score (see ConsensusScore if n_cat=NULL or StabilityScore otherwise) calculated under the null hypothesis of equiprobability of selection.

It is strongly recommended to examine the calibration plot carefully to check that the grids of parameters Lambda and pi_list do not restrict the calibration to a region that would not include the global maximum (see CalibrationPlot). In particular, the grid Lambda may need to be extended when the maximum stability is observed on the left or right edges of the calibration heatmap. In some instances, multiple peaks of stability score can be observed. Simulation studies suggest that the peak corresponding to the largest number of selected features tend to give better selection performances. This is not necessarily the highest peak (which is automatically retained by the functions in this package). The user can decide to manually choose another peak.

To control the expected number of False Positives (Per Family Error Rate) in the results, a threshold PFER_thr can be specified. The optimisation problem is then constrained to sets of parameters that generate models with an upper-bound in PFER below PFER_thr (see Meinshausen and Bühlmann (2010) and Shah and Samworth (2013)).

Possible resampling procedures include defining (i) K subsamples of a proportion tau of the observations, (ii) K bootstrap samples with the full sample size (obtained with replacement), and (iii) K/2 splits of the data in half for complementary pair stability selection (see arguments resampling and cpss). In complementary pair stability selection, a feature is considered selected at a given resampling iteration if it is selected in the two complementary subsamples.

To ensure reproducibility of the results, the starting number of the random number generator is set to seed.

For parallelisation, stability selection with different sets of parameters can be run on n_cores cores. Using n_cores > 1 creates a multisession. Alternatively, the function can be run manually with different seeds and all other parameters equal. The results can then be combined using Combine.

The generated network can be converted into igraph object using Graph. The R package visNetwork can be used for interactive network visualisation (see examples in Graph).

Value

An object of class graphical_model. A list with:

`S`	a matrix of the best stability scores for different (sets of) parameters controlling the level of sparsity in the underlying algorithm.
`Lambda`	a matrix of parameters controlling the level of sparsity in the underlying algorithm.
`Q`	a matrix of the average number of selected features by the underlying algorithm with different parameters controlling the level of sparsity.
`Q_s`	a matrix of the calibrated number of stably selected features with different parameters controlling the level of sparsity.
`P`	a matrix of calibrated thresholds in selection proportions for different parameters controlling the level of sparsity in the underlying algorithm.
`PFER`	a matrix of upper-bounds in PFER of calibrated stability selection models with different parameters controlling the level of sparsity.
`FDP`	a matrix of upper-bounds in FDP of calibrated stability selection models with different parameters controlling the level of sparsity.
`S_2d`	a matrix of stability scores obtained with different combinations of parameters. Columns correspond to different thresholds in selection proportions.
`PFER_2d`	a matrix of upper-bounds in FDP obtained with different combinations of parameters. Columns correspond to different thresholds in selection proportions. Only returned if `length(pk)=1`.
`FDP_2d`	a matrix of upper-bounds in PFER obtained with different combinations of parameters. Columns correspond to different thresholds in selection proportions. Only returned if `length(pk)=1`.
`selprop`	an array of selection proportions. Rows and columns correspond to nodes in the graph. Indices along the third dimension correspond to different parameters controlling the level of sparsity in the underlying algorithm.
`sign`	a matrix of signs of Pearson's correlations estimated from `xdata`.
`method`	a list with `type="graphical_model"` and values used for arguments `implementation`, `start`, `resampling`, `cpss` and `PFER_method`.
`params`	a list with values used for arguments `K`, `pi_list`, `tau`, `n_cat`, `pk`, `n` (number of observations in `xdata`), `PFER_thr`, `FDP_thr`, `seed`, `lambda_other_blocks`, and `Sequential_template`.

The rows of S, Lambda, Q, Q_s, P, PFER, FDP, S_2d, PFER_2d and FDP_2d, and indices along the third dimension of selprop are ordered in the same way and correspond to parameter values stored in Lambda. For multi-block inference, the columns of S, Lambda, Q, Q_s, P, PFER and FDP, and indices along the third dimension of S_2d correspond to the different blocks.

References

Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023). “Automated calibration for stability selection in penalised regression and graphical models.” Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058. ISSN 0035-9254, doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.

Shah RD, Samworth RJ (2013). “Variable selection with error control: another look at stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), 55-80. doi:10.1111/j.1467-9868.2011.01034.x.

Meinshausen N, Bühlmann P (2010). “Stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473. doi:10.1111/j.1467-9868.2010.00740.x.

Friedman J, Hastie T, Tibshirani R (2008). “Sparse inverse covariance estimation with the graphical lasso.” Biostatistics, 9(3), 432–441.

Examples


oldpar <- par(no.readonly = TRUE)
par(mar = rep(7, 4))

## Single-block stability selection

# Data simulation
set.seed(1)
simul <- SimulateGraphical(n = 100, pk = 20, nu_within = 0.1)

# Stability selection
stab <- GraphicalModel(xdata = simul$data)
print(stab)

# Calibration heatmap
CalibrationPlot(stab)

# Visualisation of the results
summary(stab)
plot(stab)

# Extraction of adjacency matrix or igraph object
Adjacency(stab)
Graph(stab)


## Multi-block stability selection

# Data simulation
set.seed(1)
simul <- SimulateGraphical(pk = c(10, 10))

# Stability selection
stab <- GraphicalModel(xdata = simul$data, pk = c(10, 10), Lambda_cardinal = 10)
print(stab)

# Calibration heatmap
# par(mfrow = c(1, 3))
CalibrationPlot(stab) # Producing three plots

# Visualisation of the results
summary(stab)
plot(stab)

# Multi-parameter stability selection (not recommended)
Lambda <- matrix(c(0.8, 0.6, 0.3, 0.5, 0.4, 0.3, 0.7, 0.5, 0.1), ncol = 3)
stab <- GraphicalModel(
  xdata = simul$data, pk = c(10, 10),
  Lambda = Lambda, lambda_other_blocks = NULL
)
stab$Lambda


## Example with user-defined function: shrinkage estimation and selection

# Data simulation
set.seed(1)
simul <- SimulateGraphical(n = 100, pk = 20, nu_within = 0.1)

if (requireNamespace("corpcor", quietly = TRUE)) {
  # Writing user-defined algorithm in a portable function
  ShrinkageSelection <- function(xdata, Lambda, ...) {
    mypcor <- corpcor::pcor.shrink(xdata, verbose = FALSE)
    adjacency <- array(NA, dim = c(nrow(mypcor), ncol(mypcor), nrow(Lambda)))
    for (k in seq_len(nrow(Lambda))) {
      A <- ifelse(abs(mypcor) >= Lambda[k, 1], yes = 1, no = 0)
      diag(A) <- 0
      adjacency[, , k] <- A
    }
    return(list(adjacency = adjacency))
  }

  # Running the algorithm without stability
  myglasso <- GraphicalAlgo(
    xdata = simul$data,
    Lambda = matrix(c(0.05, 0.1), ncol = 1), implementation = ShrinkageSelection
  )

  # Stability selection using shrinkage estimation and selection
  stab <- GraphicalModel(
    xdata = simul$data, Lambda = matrix(c(0.01, 0.05, 0.1), ncol = 1),
    implementation = ShrinkageSelection
  )
  CalibrationPlot(stab)
  stable_adjacency <- Adjacency(stab)
}

par(oldpar)

[Package sharp version 1.4.6 Index]