CMB.Stabsel {gfboost}R Documentation

Loss-adapted Stability Selection

Description

Workhorse function for the Stability Selection variant where either a grid of thresholds or a grid of cardinalities is given so that the Boosting models are evaluated on a validation set according to all elements of the respective grid. The model which performs best is finally selected as stable model.

Usage

CMB.Stabsel(
  Dtrain,
  nsing,
  Bsing = 1,
  B = 100,
  alpha = 1,
  singfam = Gaussian(),
  evalfam = Gaussian(),
  sing = FALSE,
  M = 10,
  m_iter = 100,
  kap = 0.1,
  LS = FALSE,
  best = 1,
  wagg,
  gridtype,
  grid,
  Dvalid,
  ncmb,
  robagg = FALSE,
  lower = 0,
  singcoef = FALSE,
  Mfinal,
  ...
)

Arguments

Dtrain

Data matrix. Has to be an n \times (p+1)-dimensional data frame in the format (X,Y). The X-part must not contain an intercept column containing only ones since this column will be added automatically.

nsing

Number of observations (rows) used for the SingBoost submodels.

Bsing

Number of subsamples based on which the SingBoost models are validated. Default is 1. Not to confuse with parameter B for the Stability Selection.

B

Number of subsamples based on which the CMB models are validated. Default is 100. Not to confuse with Bsing for CMB.

alpha

Optional real number in ]0,1]. Defines the fraction of best SingBoost models used in the aggregation step. Default is 1 (use all models).

singfam

A SingBoost family. The SingBoost models are trained based on the corresponding loss function. Default is Gaussian() (squared loss).

evalfam

A SingBoost family. The SingBoost models are validated according to the corresponding loss function. Default is Gaussian() (squared loss).

sing

If sing=FALSE and the singfam family is a standard Boosting family that is contained in the package mboost, the CMB aggregation procedure is executed for the corresponding standard Boosting models.

M

An integer between 2 and m_iter. Indicates that in every M-th iteration, a singular iteration will be performed. Default is 10.

m_iter

Number of SingBoost iterations. Default is 100.

kap

Learning rate (step size). Must be a real number in ]0,1]. Default is 0.1 It is recommended to use a value smaller than 0.5.

LS

If a singfamily object that is already provided by mboost is used, the respective Boosting algorithm will be performed in the singular iterations if Ls is set to TRUE. Default is FALSE.

best

Needed in the case of localized ranking. The parameter K of the localized ranking loss will be computed by best \cdot n (rounded to the next larger integer). Warning: If a parameter K is inserted into the LocRank family, it will be ignored when executing SingBoost.

wagg

Type of row weight aggregation. 'weights1' indicates that the selection frequencies of the (best) SingBoost models are averaged. 'weights2' respects the validation losses for each model and downweights the ones with higher validation losses.

gridtype

Choose between 'pigrid' and 'qgrid'.

grid

The grid for the thresholds (in ]0,1]) or the numbers of final variables (positive integers).

Dvalid

Validation data for selecting the optimal element of the grid and with it the best corresponding model.

ncmb

Number of samples used for CMB. Integer that must be smaller than the number of samples in Dtrain and higher than nsing.

robagg

Optional. If setting robagg=TRUE, the best SingBoost models are ignored when executing the aggregation to avoid inlier effects. Only reasonable in combination with lower.

lower

Optional argument. Only reasonable when setting robagg=TRUE. lower is a real number in [0,1[ (a rather small number is recommended) and indicates that the aggregation ignores the SingBoost models with the best performances to avoid possible inlier effects.

singcoef

Default is FALSE. Then the coefficients for the candidate stable models are computed by standard linear regression (provided that the number of columns is smaller than the number of samples in the training set for each grid element). If set to TRUE, the coefficients are computed by SingBoost.

Mfinal

Optional. Necessary if singcoef=TRUE to determine the frequency of singular iterations in the SingBoost models.

...

Optional further arguments

Details

The Stability Selection in the packages stabs and mboost requires to fix two of three parameters which are the per-family error rate, the threshold and the number of variables which have to be selected in each model. Our Stability Selection is based on another idea. We also train Boosting models on subsamples but we use a validation step to determine the size of the optimal model. More precisely, if 'pigrid' is used as gridtype, the corresponding stable models for each threshold are computed by selecting all variables whose aggregated selection frequency exceeds the threshold. Then, these candidate stable models are validated according to the target loss function (inserted through evalfam) and the optimal one is finally selected. If 'qgrid' is used as gridtype, a vector of positive integers has to be entered instead of a vector of thresholds. The candidate stable models then consist of the best variables ordered by their aggregated selection frequencies, respectively. The validation step is the same.

Value

colind.opt

The column numbers of the variables that form the best stable model as a vector.

coeff.opt

The coefficients corresponding to the optimal stable model as a vector.

aggnu

Aggregated empirical column measure (i.e., selection frequencies) as a vector.

aggzeta

Aggregated empirical row measure (i.e., row weights) as a vector.

References

Werner, T., Gradient-Free Gradient Boosting, PhD Thesis, Carl von Ossietzky University Oldenburg, 2020

T. Hothorn, P. Bühlmann, T. Kneib, M. Schmid, and B. Hofner. mboost: Model-Based Boosting, 2017

B. Hofner and T. Hothorn. stabs: Stability Selection with Error Control, 2017.

B. Hofner, L. Boccuto, and M. Göker. Controlling false discoveries in high-dimensional situations: Boosting with stability selection. BMC Bioinformatics, 16(1):144, 2015.

N. Meinshausen and P. Bühlmann. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–473, 2010.


[Package gfboost version 0.1.1 Index]