R: Stability selection of predictors and/or outcomes

BiSelection {sharp}

R Documentation

Stability selection of predictors and/or outcomes

Description

Performs stability selection for dimensionality reduction. The underlying variable selection algorithm (e.g. sparse PLS) is run with different combinations of parameters controlling the sparsity (e.g. number of selected variables per component) and thresholds in selection proportions. These hyper-parameters are jointly calibrated by maximisation of the stability score.

Usage

BiSelection(
  xdata,
  ydata = NULL,
  group_x = NULL,
  group_y = NULL,
  LambdaX = NULL,
  LambdaY = NULL,
  AlphaX = NULL,
  AlphaY = NULL,
  ncomp = 1,
  scale = TRUE,
  pi_list = seq(0.01, 0.99, by = 0.01),
  K = 100,
  tau = 0.5,
  seed = 1,
  n_cat = NULL,
  family = "gaussian",
  implementation = SparsePLS,
  resampling = "subsampling",
  cpss = FALSE,
  PFER_method = "MB",
  PFER_thr = Inf,
  FDP_thr = Inf,
  n_cores = 1,
  output_data = FALSE,
  verbose = TRUE,
  beep = NULL,
  ...
)

Arguments

`xdata`	matrix of predictors with observations as rows and variables as columns.
`ydata`	optional vector or matrix of outcome(s). If `family` is set to `"binomial"` or `"multinomial"`, `ydata` can be a vector with character/numeric values or a factor.
`group_x`	vector encoding the grouping structure among predictors. This argument indicates the number of variables in each group. Only used for models with group penalisation (e.g. `implementation=GroupPLS` or `implementation=SparseGroupPLS`).
`group_y`	optional vector encoding the grouping structure among outcomes. This argument indicates the number of variables in each group. Only used if `implementation=GroupPLS` or `implementation=SparseGroupPLS`.
`LambdaX`	matrix of parameters controlling the number of selected variables (for sparse PCA/PLS) or groups (for group and sparse group PLS) in X.
`LambdaY`	matrix of parameters controlling the number of selected variables (for sparse PLS) or groups (for group or sparse group PLS) in Y. Only used if `family="gaussian"`.
`AlphaX`	matrix of parameters controlling the level of sparsity within groups in X. Only used if `implementation=SparseGroupPLS`.
`AlphaY`	matrix of parameters controlling the level of sparsity within groups in X. Only used if `implementation=SparseGroupPLS` and `family="gaussian"`.
`ncomp`	number of components.
`scale`	logical indicating if the data should be scaled (i.e. transformed so that all variables have a standard deviation of one).
`pi_list`	vector of thresholds in selection proportions. If `n_cat=NULL` or `n_cat=2`, these values must be `>0` and `<1`. If `n_cat=3`, these values must be `>0.5` and `<1`.
`K`	number of resampling iterations.
`tau`	subsample size. Only used if `resampling="subsampling"` and `cpss=FALSE`.
`seed`	value of the seed to initialise the random number generator and ensure reproducibility of the results (see `set.seed`).
`n_cat`	computation options for the stability score. Default is `NULL` to use the score based on a z test. Other possible values are 2 or 3 to use the score based on the negative log-likelihood.
`family`	type of PLS model. This parameter must be set to `family="gaussian"` for continuous outcomes, or to `family="binomial"` for categorical outcomes. Only used if `ydata` is provided.
`implementation`	function to use for feature selection. Possible functions are: `SparsePCA`, `SparsePLS`, `GroupPLS`, `SparseGroupPLS`.
`resampling`	resampling approach. Possible values are: `"subsampling"` for sampling without replacement of a proportion `tau` of the observations, or `"bootstrap"` for sampling with replacement generating a resampled dataset with as many observations as in the full sample. Alternatively, this argument can be a function to use for resampling. This function must use arguments named `data` and `tau` and return the IDs of observations to be included in the resampled dataset.
`cpss`	logical indicating if complementary pair stability selection should be done. For this, the algorithm is applied on two non-overlapping subsets of half of the observations. A feature is considered as selected if it is selected for both subsamples. With this method, the data is split `K/2` times (`K` models are fitted). Only used if `PFER_method="MB"`.
`PFER_method`	method used to compute the upper-bound of the expected number of False Positives (or Per Family Error Rate, PFER). If `PFER_method="MB"`, the method proposed by Meinshausen and Bühlmann (2010) is used. If `PFER_method="SS"`, the method proposed by Shah and Samworth (2013) under the assumption of unimodality is used.
`PFER_thr`	threshold in PFER for constrained calibration by error control. If `PFER_thr=Inf` and `FDP_thr=Inf`, unconstrained calibration is used (the default).
`FDP_thr`	threshold in the expected proportion of falsely selected features (or False Discovery Proportion) for constrained calibration by error control. If `PFER_thr=Inf` and `FDP_thr=Inf`, unconstrained calibration is used (the default).
`n_cores`	number of cores to use for parallel computing (see argument `workers` in `multisession`). Using `n_cores>1` is only supported with `optimisation="grid_search"`.
`output_data`	logical indicating if the input datasets `xdata` and `ydata` should be included in the output.
`verbose`	logical indicating if a loading bar and messages should be printed.
`beep`	sound indicating the end of the run. Possible values are: `NULL` (no sound) or an integer between 1 and 11 (see argument `sound` in `beep`).
`...`	additional parameters passed to the functions provided in `implementation` or `resampling`.

Details

In stability selection, a feature selection algorithm is fitted on K subsamples (or bootstrap samples) of the data with different parameters controlling the sparsity (LambdaX, LambdaY, AlphaX, and/or AlphaY). For a given (set of) sparsity parameter(s), the proportion out of the K models in which each feature is selected is calculated. Features with selection proportions above a threshold pi are considered stably selected. The stability selection model is controlled by the sparsity parameter(s) (denoted by \lambda) for the underlying algorithm, and the threshold in selection proportion:

V_{\lambda, \pi} = \{ j: p_{\lambda}(j) \ge \pi \}

For sparse and sparse group dimensionality reduction, "feature" refers to variable (variable selection model). For group PLS, "feature" refers to group (group selection model). For (sparse) group PLS, groups need to be defined a priori and specified in arguments group_x and/or group_y.

These parameters can be calibrated by maximisation of a stability score (see ConsensusScore if n_cat=NULL or StabilityScore otherwise) calculated under the null hypothesis of equiprobability of selection.

It is strongly recommended to examine the calibration plot carefully to check that the grids of parameters Lambda and pi_list do not restrict the calibration to a region that would not include the global maximum (see CalibrationPlot). In particular, the grid Lambda may need to be extended when the maximum stability is observed on the left or right edges of the calibration heatmap. In some instances, multiple peaks of stability score can be observed. Simulation studies suggest that the peak corresponding to the largest number of selected features tend to give better selection performances. This is not necessarily the highest peak (which is automatically retained by the functions in this package). The user can decide to manually choose another peak.

To control the expected number of False Positives (Per Family Error Rate) in the results, a threshold PFER_thr can be specified. The optimisation problem is then constrained to sets of parameters that generate models with an upper-bound in PFER below PFER_thr (see Meinshausen and Bühlmann (2010) and Shah and Samworth (2013)).

Possible resampling procedures include defining (i) K subsamples of a proportion tau of the observations, (ii) K bootstrap samples with the full sample size (obtained with replacement), and (iii) K/2 splits of the data in half for complementary pair stability selection (see arguments resampling and cpss). In complementary pair stability selection, a feature is considered selected at a given resampling iteration if it is selected in the two complementary subsamples.

For categorical outcomes (argument family is "binomial" or "multinomial"), the proportions of observations from each category in all subsamples or bootstrap samples are the same as in the full sample.

To ensure reproducibility of the results, the starting number of the random number generator is set to seed.

For parallelisation, stability selection with different sets of parameters can be run on n_cores cores. Using n_cores > 1 creates a multisession.

Value

An object of class bi_selection. A list with:

`summary`	a matrix of the best stability scores and corresponding parameters controlling the level of sparsity in the underlying algorithm for different numbers of components. Possible columns include: `comp` (component index), `nx` (number of predictors to include, parameter of the underlying algorithm), `alphax` (sparsity within the predictor groups, parameter of the underlying algorithm), `pix` (threshold in selection proportion for predictors), `ny` (number of outcomes to include, parameter of the underlying algorithm), `alphay` (sparsity within the outcome groups, parameter of the underlying algorithm), `piy` (threshold in selection proportion for outcomes), `S` (stability score). Columns that are not relevant to the model are not reported (e.g. `alpha_x` and `alpha_y` are not returned for sparse PLS models).
`summary_full`	a matrix of the best stability scores for different combinations of parameters controlling the sparsity and components.
`selectedX`	a binary matrix encoding stably selected predictors.
`selpropX`	a matrix of calibrated selection proportions for predictors.
`selectedY`	a binary matrix encoding stably selected outcomes. Only returned for PLS models.
`selpropY`	a matrix of calibrated selection proportions for outcomes. Only returned for PLS models.
`selected`	a binary matrix encoding stable relationships between predictor and outcome variables. Only returned for PLS models.
`selectedX_full`	a binary matrix encoding stably selected predictors.
`selpropX_full`	a matrix of selection proportions for predictors.
`selectedY_full`	a binary matrix encoding stably selected outcomes. Only returned for PLS models.
`selpropY_full`	a matrix of selection proportions for outcomes. Only returned for PLS models.
`coefX`	an array of estimated loadings coefficients for the different components (rows), for the predictors (columns), as obtained across the `K` visited models (along the third dimension).
`coefY`	an array of estimated loadings coefficients for the different components (rows), for the outcomes (columns), as obtained across the `K` visited models (along the third dimension). Only returned for PLS models.
`method`	a list with `type="bi_selection"` and values used for arguments `implementation`, `family`, `scale`, `resampling`, `cpss` and `PFER_method`.
`params`	a list with values used for arguments `K`, `group_x`, `group_y`, `LambdaX`, `LambdaY`, `AlphaX`, `AlphaY`, `pi_list`, `tau`, `n_cat`, `pk`, `n` (number of observations), `PFER_thr`, `FDP_thr` and `seed`. The datasets `xdata` and `ydata` are also included if `output_data=TRUE`.

The rows of summary and columns of selectedX, selectedY, selpropX, selpropY, selected, coefX and coefY are ordered in the same way and correspond to components and parameter values stored in summary. The rows of summary_full and columns of selectedX_full, selectedY_full, selpropX_full and selpropY_full are ordered in the same way and correspond to components and parameter values stored in summary_full.

References

Bodinier B, Filippi S, Nøst TH, Chiquet J, Chadeau-Hyam M (2023). “Automated calibration for stability selection in penalised regression and graphical models.” Journal of the Royal Statistical Society Series C: Applied Statistics, qlad058. ISSN 0035-9254, doi:10.1093/jrsssc/qlad058, https://academic.oup.com/jrsssc/advance-article-pdf/doi/10.1093/jrsssc/qlad058/50878777/qlad058.pdf.

Shah RD, Samworth RJ (2013). “Variable selection with error control: another look at stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), 55-80. doi:10.1111/j.1467-9868.2011.01034.x.

Meinshausen N, Bühlmann P (2010). “Stability selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473. doi:10.1111/j.1467-9868.2010.00740.x.

Liquet B, de Micheaux PL, Hejblum BP, Thiébaut R (2016). “Group and sparse group partial least square approaches applied in genomics context.” Bioinformatics, 32(1), 35-42. ISSN 1367-4803, doi:10.1093/bioinformatics/btv535.

KA LC, Rossouw D, Robert-Granié C, Besse P (2008). “A sparse PLS for variable selection when integrating omics data.” Stat Appl Genet Mol Biol, 7(1), Article 35. ISSN 1544-6115, doi:10.2202/1544-6115.1390.

Shen H, Huang JZ (2008). “Sparse principal component analysis via regularized low rank matrix approximation.” Journal of Multivariate Analysis, 99(6), 1015-1034. ISSN 0047-259X, doi:10.1016/j.jmva.2007.06.007.

Zou H, Hastie T, Tibshirani R (2006). “Sparse Principal Component Analysis.” Journal of Computational and Graphical Statistics, 15(2), 265-286. doi:10.1198/106186006X113430.

Examples


if (requireNamespace("sgPLS", quietly = TRUE)) {
  oldpar <- par(no.readonly = TRUE)
  par(mar = c(12, 5, 1, 1))

  ## Sparse Principal Component Analysis

  # Data simulation
  set.seed(1)
  simul <- SimulateComponents(pk = c(5, 3, 4))

  # sPCA: sparsity on X (unsupervised)
  stab <- BiSelection(
    xdata = simul$data,
    ncomp = 2,
    LambdaX = seq_len(ncol(simul$data) - 1),
    implementation = SparsePCA
  )
  print(stab)

  # Calibration plot
  CalibrationPlot(stab)

  # Visualisation of the results
  summary(stab)
  plot(stab)
  SelectedVariables(stab)


  ## Sparse (Group) Partial Least Squares

  # Data simulation (continuous outcomes)
  set.seed(1)
  simul <- SimulateRegression(n = 100, pk = 15, q = 3, family = "gaussian")
  x <- simul$xdata
  y <- simul$ydata

  # sPLS: sparsity on X
  stab <- BiSelection(
    xdata = x, ydata = y,
    family = "gaussian", ncomp = 3,
    LambdaX = seq_len(ncol(x) - 1),
    implementation = SparsePLS
  )
  CalibrationPlot(stab)
  summary(stab)
  plot(stab)

  # sPLS: sparsity on both X and Y
  stab <- BiSelection(
    xdata = x, ydata = y,
    family = "gaussian", ncomp = 3,
    LambdaX = seq_len(ncol(x) - 1),
    LambdaY = seq_len(ncol(y) - 1),
    implementation = SparsePLS,
    n_cat = 2
  )
  CalibrationPlot(stab)
  summary(stab)
  plot(stab)

  # sgPLS: sparsity on X
  stab <- BiSelection(
    xdata = x, ydata = y, K = 10,
    group_x = c(2, 8, 5),
    family = "gaussian", ncomp = 3,
    LambdaX = seq_len(2), AlphaX = seq(0.1, 0.9, by = 0.1),
    implementation = SparseGroupPLS
  )
  CalibrationPlot(stab)
  summary(stab)

  par(oldpar)
}

[Package sharp version 1.4.6 Index]