R: Column Measure Boosting with SingBoost and Stability...

CMB3S {gfboost}

R Documentation

Column Measure Boosting with SingBoost and Stability Selection (CMB-3S)

Description

Executes CMB and the loss-based Stability Selection.

Usage

CMB3S(
  Dtrain,
  nsing,
  Bsing = 1,
  B = 100,
  alpha = 1,
  singfam = Gaussian(),
  evalfam = Gaussian(),
  sing = FALSE,
  M = 10,
  m_iter = 100,
  kap = 0.1,
  LS = FALSE,
  best = 1,
  wagg,
  gridtype,
  grid,
  Dvalid,
  ncmb,
  robagg = FALSE,
  lower = 0,
  singcoef = FALSE,
  Mfinal = 10,
  ...
)

Arguments

`Dtrain`	Data matrix. Has to be an `n \times (p+1)-`dimensional data frame in the format `(X,Y)`. The `X-`part must not contain an intercept column containing only ones since this column will be added automatically.
`nsing`	Number of observations (rows) used for the SingBoost submodels.
`Bsing`	Number of subsamples based on which the SingBoost models are validated. Default is 1. Not to confuse with parameter `B` for the Stability Selection.
`B`	Number of subsamples based on which the CMB models are validated. Default is 100. Not to confuse with `Bsing` for CMB.
`alpha`	Optional real number in `]0,1]`. Defines the fraction of best SingBoost models used in the aggregation step. Default is 1 (use all models).
`singfam`	A SingBoost family. The SingBoost models are trained based on the corresponding loss function. Default is `Gaussian()` (squared loss).
`evalfam`	A SingBoost family. The SingBoost models are validated according to the corresponding loss function. Default is `Gaussian()` (squared loss).
`sing`	If `sing=FALSE` and the `singfam` family is a standard Boosting family that is contained in the package `mboost`, the CMB aggregation procedure is executed for the corresponding standard Boosting models.
`M`	An integer between 2 and `m_iter`. Indicates that in every `M-`th iteration, a singular iteration will be performed. Default is 10.
`m_iter`	Number of SingBoost iterations. Default is 100.
`kap`	Learning rate (step size). Must be a real number in `]0,1]`. Default is 0.1 It is recommended to use a value smaller than 0.5.
`LS`	If a `singfamily` object that is already provided by `mboost` is used, the respective Boosting algorithm will be performed in the singular iterations if `Ls` is set to `TRUE`. Default is `FALSE`.
`best`	Needed in the case of localized ranking. The parameter `K` of the localized ranking loss will be computed by `best \cdot n` (rounded to the next larger integer). Warning: If a parameter `K` is inserted into the `LocRank` family, it will be ignored when executing SingBoost.
`wagg`	Type of row weight aggregation. `'weights1'` indicates that the selection frequencies of the (best) SingBoost models are averaged. `'weights2'` respects the validation losses for each model and downweights the ones with higher validation losses.
`gridtype`	Choose between `'pigrid'` and `'qgrid'`.
`grid`	The grid for the thresholds (in `]0,1]`) or the numbers of final variables (positive integers).
`Dvalid`	Validation data for selecting the optimal element of the grid and with it the best corresponding model.
`ncmb`	Number of samples used for `CMB`. Integer that must be smaller than the number of samples in `Dtrain`.
`robagg`	Optional. If setting `robagg=TRUE`, the best SingBoost models are ignored when executing the aggregation to avoid inlier effects. Only reasonable in combination with `lower`.
`lower`	Optional argument. Only reasonable when setting `robagg=TRUE`. `lower` is a real number in `[0,1[` (a rather small number is recommended) and indicates that the aggregation ignores the SingBoost models with the best performances to avoid possible inlier effects.
`singcoef`	Default is `FALSE`. Then the coefficients for the candidate stable models are computed by standard linear regression (provided that the number of columns is smaller than the number of samples in the training set for each grid element). If set to `TRUE`, the coefficients are computed by SingBoost.
`Mfinal`	Optional. Necessary if `singcoef=TRUE` to determine the frequency of singular iterations in the SingBoost models.
`...`	Optional further arguments

Details

See CMB and CMB.Stabsel.

Value

`Final coefficients`	The coefficients corresponding to the optimal stable model as a vector.
`Stable column measure`	Aggregated empirical column measure (i.e., selection frequencies) as a vector.
`Selected columns`	The column numbers of the variables that form the best stable model as a vector.
`Used row measure`	Aggregated empirical row measure (i.e., row weights) as a vector.

References

Werner, T., Gradient-Free Gradient Boosting, PhD Thesis, Carl von Ossietzky University Oldenburg, 2020

T. Hothorn, P. Bühlmann, T. Kneib, M. Schmid, and B. Hofner. mboost: Model-Based Boosting, 2017

B. Hofner and T. Hothorn. stabs: Stability Selection with Error Control, 2017.

B. Hofner, L. Boccuto, and M. Göker. Controlling false discoveries in high-dimensional situations: Boosting with stability selection. BMC Bioinformatics, 16(1):144, 2015.

N. Meinshausen and P. Bühlmann. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–473, 2010.

Examples

firis<-as.formula(Sepal.Length~.)
Xiris<-model.matrix(firis,iris)
Diris<-data.frame(Xiris[,-1],iris$Sepal.Length)
colnames(Diris)[6]<-"Y"
set.seed(19931023)
ind<-sample(1:150,120,replace=FALSE)
Dtrain<-Diris[ind,]
Dvalid<-Diris[-ind,]
set.seed(19931023)
cmb3s<-CMB3S(Dtrain,nsing=120,Dvalid=Dvalid,ncmb=120,Bsing=1,B=1,alpha=1,singfam=Gaussian()
,evalfam=Gaussian(),sing=FALSE,M=10,m_iter=100,kap=0.1,LS=FALSE,wagg='weights1',
gridtype='pigrid',grid=seq(0.8,0.9,1),robagg=FALSE,lower=0,singcoef=TRUE,Mfinal=10)
cmb3s$Fin
cmb3s$Stab
cmb3s$Sel
glmres4<-glmboost(Sepal.Length~.,iris[ind,])
coef(glmres4)
set.seed(19931023)
cmb3s1<-CMB3S(Dtrain,nsing=80,Dvalid=Dvalid,ncmb=100,Bsing=10,B=100,alpha=0.5,singfam=Gaussian(),
evalfam=Gaussian(),sing=FALSE,M=10,m_iter=100,kap=0.1,LS=FALSE,wagg='weights1',gridtype='pigrid',
grid=seq(0.8,0.9,1),robagg=FALSE,lower=0,singcoef=TRUE,Mfinal=10)
cmb3s1$Fin
cmb3s1$Stab
## This will may take around a minute
set.seed(19931023)
cmb3s2<-CMB3S(Dtrain,nsing=80,Dvalid=Dvalid,ncmb=100,Bsing=10,B=100,alpha=0.5,singfam=Rank(),
evalfam=Rank(),sing=TRUE,M=10,m_iter=100,kap=0.1,LS=TRUE,wagg='weights2',gridtype='pigrid',
grid=seq(0.8,0.9,1),robagg=FALSE,lower=0,singcoef=TRUE,Mfinal=10)
cmb3s2$Fin
cmb3s2$Stab
set.seed(19931023)
cmb3s3<-CMB3S(Dtrain,nsing=80,Dvalid=Dvalid,ncmb=100,Bsing=10,B=100,alpha=0.5,singfam=Huber(),
evalfam=Huber(),sing=FALSE,M=10,m_iter=100,kap=0.1,LS=FALSE,wagg='weights2',gridtype='pigrid',
grid=seq(0.8,0.9,1),robagg=FALSE,lower=0,singcoef=FALSE,Mfinal=10)
cmb3s3$Fin
cmb3s3$Stab

[Package gfboost version 0.1.1 Index]