CMB.Stabsel {gfboost} | R Documentation |
Loss-adapted Stability Selection
Description
Workhorse function for the Stability Selection variant where either a grid of thresholds or a grid of cardinalities is given so that the Boosting models are evaluated on a validation set according to all elements of the respective grid. The model which performs best is finally selected as stable model.
Usage
CMB.Stabsel(
Dtrain,
nsing,
Bsing = 1,
B = 100,
alpha = 1,
singfam = Gaussian(),
evalfam = Gaussian(),
sing = FALSE,
M = 10,
m_iter = 100,
kap = 0.1,
LS = FALSE,
best = 1,
wagg,
gridtype,
grid,
Dvalid,
ncmb,
robagg = FALSE,
lower = 0,
singcoef = FALSE,
Mfinal,
...
)
Arguments
Dtrain |
Data matrix. Has to be an |
nsing |
Number of observations (rows) used for the SingBoost submodels. |
Bsing |
Number of subsamples based on which the SingBoost models are validated. Default is 1. Not to confuse with parameter |
B |
Number of subsamples based on which the CMB models are validated. Default is 100. Not to confuse with |
alpha |
Optional real number in |
singfam |
A SingBoost family. The SingBoost models are trained based on the corresponding loss function. Default is |
evalfam |
A SingBoost family. The SingBoost models are validated according to the corresponding loss function. Default is |
sing |
If |
M |
An integer between 2 and |
m_iter |
Number of SingBoost iterations. Default is 100. |
kap |
Learning rate (step size). Must be a real number in |
LS |
If a |
best |
Needed in the case of localized ranking. The parameter |
wagg |
Type of row weight aggregation. |
gridtype |
Choose between |
grid |
The grid for the thresholds (in |
Dvalid |
Validation data for selecting the optimal element of the grid and with it the best corresponding model. |
ncmb |
Number of samples used for |
robagg |
Optional. If setting |
lower |
Optional argument. Only reasonable when setting |
singcoef |
Default is |
Mfinal |
Optional. Necessary if |
... |
Optional further arguments |
Details
The Stability Selection in the packages stabs
and mboost
requires to fix two of three parameters which are
the per-family error rate, the threshold and the number of variables which have to be selected in each model. Our
Stability Selection is based on another idea. We also train Boosting models on subsamples but we use a validation
step to determine the size of the optimal model. More precisely, if 'pigrid'
is used as gridtype
, the corresponding
stable models for each threshold are computed by selecting all variables whose aggregated selection frequency exceeds
the threshold. Then, these candidate stable models are validated according to the target loss function (inserted
through evalfam
) and the optimal one is finally selected. If 'qgrid'
is used as gridtype
, a vector of positive
integers has to be entered instead of a vector of thresholds. The candidate stable models then consist of the best
variables ordered by their aggregated selection frequencies, respectively. The validation step is the same.
Value
colind.opt |
The column numbers of the variables that form the best stable model as a vector. |
coeff.opt |
The coefficients corresponding to the optimal stable model as a vector. |
aggnu |
Aggregated empirical column measure (i.e., selection frequencies) as a vector. |
aggzeta |
Aggregated empirical row measure (i.e., row weights) as a vector. |
References
Werner, T., Gradient-Free Gradient Boosting, PhD Thesis, Carl von Ossietzky University Oldenburg, 2020
T. Hothorn, P. Bühlmann, T. Kneib, M. Schmid, and B. Hofner. mboost: Model-Based Boosting, 2017
B. Hofner and T. Hothorn. stabs: Stability Selection with Error Control, 2017.
B. Hofner, L. Boccuto, and M. Göker. Controlling false discoveries in high-dimensional situations: Boosting with stability selection. BMC Bioinformatics, 16(1):144, 2015.
N. Meinshausen and P. Bühlmann. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–473, 2010.