stabsel {stabs} | R Documentation |
Stability Selection
Description
Selection of influential variables or model components with error control.
Usage
## generic stability selection funcion
stabsel(x, ...)
## a method to fit models with stability selection
## S3 method for class 'matrix'
stabsel(x, y, fitfun = lars.lasso,
args.fitfun = list(), cutoff, q, PFER,
folds = subsample(rep(1, nrow(x)), B = B),
B = ifelse(sampling.type == "MB", 100, 50),
assumption = c("unimodal", "r-concave", "none"),
sampling.type = c("SS", "MB"),
papply = mclapply, mc.preschedule = FALSE,
verbose = TRUE, FWER, eval = TRUE, ...)
## essentially a wrapper for data.frames (see details)
## S3 method for class 'data.frame'
stabsel(x, y, intercept = FALSE, ...)
Arguments
x |
a |
y |
a vector or matrix containing the outcome. |
intercept |
logical. If |
fitfun |
a function that takes the arguments |
args.fitfun |
a named list containing additional arguments that are
passed to the fitting function; see also argument |
cutoff |
cutoff between 0.5 and 1. Preferably a value between 0.6 and 0.9 should be used. |
q |
number of (unique) selected variables (or groups of variables depending on the model) that are selected on each subsample. |
PFER |
upper bound for the per-family error rate. This specifies the amount of falsely selected base-learners, which is tolerated. See details. |
folds |
a weight matrix with number of rows equal to the number
of observations, see |
assumption |
Defines the type of assumptions on the
distributions of the selection probabilities and simultaneous
selection probabilities. Only applicable for
|
sampling.type |
use sampling scheme of of Shah & Samworth
(2013), i.e., with complementarty pairs ( |
B |
number of subsampling replicates. Per default, we use 50
complementary pairs for the error bounds of Shah & Samworth (2013)
and 100 for the error bound derived in Meinshausen & Buehlmann
(2010). As we use |
papply |
(parallel) apply function, defaults to
|
mc.preschedule |
preschedule tasks if |
verbose |
logical (default: |
FWER |
deprecated. Only for compatibility with older versions, use PFER instead. |
eval |
logical. Determines whether stability selection is
evaluated ( |
... |
additional arguments to parallel apply methods such as
|
Details
This function implements the stability selection procedure by
Meinshausen and Buehlmann (2010) and the improved error bounds by Shah
and Samworth (2013). For details see also Hofner et al. (2014). The
error bounds are implemented in the function
stabsel_parameters
. Two of the three arguments
cutoff
, q
and PFER
must be specified. The
per-family error rate (PFER), i.e., the expected number of false
positives E(V)
, where V
is the number of false positives,
is bounded by the argument PFER
.
As controlling the PFER is more conservative as controlling the
family-wise error rate (FWER), the procedure also controlls the FWER,
i.e., the probability of selecting at least one non-influential
variable (or model component) is less than PFER
.
Predefined fitfuns
functions exist but more can be
easily implemented. Note that stepwise regression methods are usually
not advised as they tend to be relatively unstable. See example below.
The function stabsel
for data.frame
s is
essentially just a wrapper to the matrix
function with
the same argments. The only difference is that in a pre-processing
step, the data set is converted to a model matrix using the function
model.matrix
. The additional argument intercept
determines if an explicit intercept should be added to the model
matrix. This is often not neccessary but depends on the fitfun
.
Value
An object of class stabsel
with a special print
method.
The object has the following elements:
phat |
selection probabilities. |
selected |
elements with maximal selection probability greater
|
max |
maximum of selection probabilities. |
cutoff |
cutoff used. |
q |
average number of selected variables used. |
PFER |
(realized) upper bound for the per-family error rate. |
specifiedPFER |
specified upper bound for the per-family error rate. |
p |
the number of effects subject to selection. |
B |
the number of subsamples. |
sampling.type |
the sampling type used for stability selection. |
assumption |
the assumptions made on the selection probabilities. |
call |
the call. |
References
B. Hofner, L. Boccuto and M. Goeker (2015), Controlling false
discoveries in high-dimensional situations: Boosting with stability
selection. BMC Bioinformatics, 16:144.
doi: 10.1186/s12859-015-0575-3.
N. Meinshausen and P. Buehlmann (2010), Stability selection. Journal of the Royal Statistical Society, Series B, 72, 417–473.
R.D. Shah and R.J. Samworth (2013), Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society, Series B, 75, 55–80.
See Also
stabsel_parameters
for the computation of error bounds,
stabsel.stabsel
for the fast re-computation of
parameters of a fitted stabsel
object,
fitfun
for available fitting functions and
plot.stabsel
for available plot functions
Examples
if (require("TH.data")) {
## make data set available
data("bodyfat", package = "TH.data")
} else {
## simulate some data if TH.data not available.
## Note that results are non-sense with this data.
bodyfat <- matrix(rnorm(720), nrow = 72, ncol = 10)
}
## set seed
set.seed(1234)
####################################################################
### using stability selection with Lasso methods:
if (require("lars")) {
(stab.lasso <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
fitfun = lars.lasso, cutoff = 0.75,
PFER = 1))
(stab.stepwise <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
fitfun = lars.stepwise, cutoff = 0.75,
PFER = 1))
par(mfrow = c(2, 1))
plot(stab.lasso, main = "Lasso")
plot(stab.stepwise, main = "Stepwise Selection")
## --> stepwise selection seems to be quite unstable even in this low
## dimensional example!
}
## set seed (again to make results comparable)
set.seed(1234)
if (require("glmnet")) {
(stab.glmnet <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
fitfun = glmnet.lasso, cutoff = 0.75,
PFER = 1))
par(mfrow = c(2, 1))
plot(stab.glmnet, main = "Lasso (glmnet)")
if (exists("stab.lasso"))
plot(stab.lasso, main = "Lasso (lars)")
}
## Select variables with maximum coefficients based on lasso estimate
set.seed(1234) # reset seed
if (require("glmnet")) {
## use cross-validated lambda
lambda.min <- cv.glmnet(x = as.matrix(bodyfat[, -2]), y = bodyfat[,2])$lambda.min
(stab.maxCoef <- stabsel(x = bodyfat[, -2], y = bodyfat[,2],
fitfun = glmnet.lasso_maxCoef,
# specify additional parameters to fitfun
args.fitfun = list(lambda = lambda.min),
cutoff = 0.75, PFER = 1))
## WARNING: Using a fixed penalty (lambda) is usually not permitted and
## not sensible. See ?fitfun for details.
## now compare standard lasso with "maximal parameter estimates" from lasso
par(mfrow = c(2, 1))
plot(stab.maxCoef, main = "Lasso (glmnet; Maximum Coefficients)")
plot(stab.glmnet, main = "Lasso (glmnet)")
## --> very different results.
}
####################################################################
### using stability selection directly on computed boosting models
### from mboost
if (require("mboost")) {
### low-dimensional example
mod <- glmboost(DEXfat ~ ., data = bodyfat)
## compute cutoff ahead of running stabsel to see if it is a sensible
## parameter choice.
## p = ncol(bodyfat) - 1 (= Outcome) + 1 ( = Intercept)
stabsel_parameters(q = 3, PFER = 1, p = ncol(bodyfat) - 1 + 1,
sampling.type = "MB")
## the same:
stabsel(mod, q = 3, PFER = 1, sampling.type = "MB", eval = FALSE)
### Do not test the following code per default on CRAN as it takes some time to run:
## now run stability selection
(sbody <- stabsel(mod, q = 3, PFER = 1, sampling.type = "MB"))
opar <- par(mai = par("mai") * c(1, 1, 1, 2.7))
plot(sbody)
par(opar)
plot(sbody, type = "maxsel", ymargin = 6)
}