cvrisk {mboost} | R Documentation |
Cross-Validation
Description
Cross-validated estimation of the empirical risk for hyper-parameter selection.
Usage
## S3 method for class 'mboost'
cvrisk(object, folds = cv(model.weights(object)),
grid = 0:mstop(object),
papply = mclapply,
fun = NULL, mc.preschedule = FALSE, ...)
cv(weights, type = c("bootstrap", "kfold", "subsampling"),
B = ifelse(type == "kfold", 10, 25), prob = 0.5, strata = NULL)
## Plot cross-valiation results
## S3 method for class 'cvrisk'
plot(x,
xlab = "Number of boosting iterations", ylab = attr(x, "risk"),
ylim = range(x), main = attr(x, "type"), ...)
Arguments
object |
an object of class |
folds |
a weight matrix with number of rows equal to the number
of observations. The number of columns corresponds to
the number of cross-validation runs. Can be computed
using function |
grid |
a vector of stopping parameters the empirical risk is to be evaluated for. |
papply |
(parallel) apply function, defaults to |
fun |
if |
mc.preschedule |
preschedule tasks if are parallelized using |
weights |
a numeric vector of weights for the model to be cross-validated. |
type |
character argument for specifying the cross-validation method. Currently (stratified) bootstrap, k-fold cross-validation and subsampling are implemented. |
B |
number of folds, per default 25 for |
prob |
percentage of observations to be included in the learning samples for subsampling. |
strata |
a factor of the same length as |
x |
an object of class |
xlab , ylab |
axis labels. |
ylim |
limits of y-axis. |
main |
main title of graphic. |
... |
Details
The number of boosting iterations is a hyper-parameter of the
boosting algorithms implemented in this package. Honest,
i.e., cross-validated, estimates of the empirical risk
for different stopping parameters mstop
are computed by
this function which can be utilized to choose an appropriate
number of boosting iterations to be applied.
Different forms of cross-validation can be applied, for example
10-fold cross-validation or bootstrapping. The weights (zero weights
correspond to test cases) are defined via the folds
matrix.
cvrisk
runs in parallel on OSes where forking is possible
(i.e., not on Windows) and multiple cores/processors are available.
The scheduling
can be changed by the corresponding arguments of
mclapply
(via the dot arguments).
The function cv
can be used to build an appropriate
weight matrix to be used with cvrisk
. If strata
is defined
sampling is performed in each stratum separately thus preserving
the distribution of the strata
variable in each fold.
There exist various functions to display and work with
cross-validation results. One can print
and plot
(see above)
results and extract the optimal iteration via mstop
.
Value
An object of class cvrisk
(when fun
wasn't specified), basically a matrix
containing estimates of the empirical risk for a varying number
of bootstrap iterations. plot
and print
methods
are available as well as a mstop
method.
References
Torsten Hothorn, Friedrich Leisch, Achim Zeileis and Kurt Hornik (2006), The design and analysis of benchmark experiments. Journal of Computational and Graphical Statistics, 14(3), 675–699.
Andreas Mayr, Benjamin Hofner, and Matthias Schmid (2012). The
importance of knowing when to stop - a sequential stopping rule for
component-wise gradient boosting. Methods of Information in
Medicine, 51, 178–186.
DOI: doi:10.3414/ME11-02-0030
See Also
AIC.mboost
for
AIC
based selection of the stopping iteration. Use mstop
to extract the optimal stopping iteration from cvrisk
object.
Examples
data("bodyfat", package = "TH.data")
### fit linear model to data
model <- glmboost(DEXfat ~ ., data = bodyfat, center = TRUE)
### AIC-based selection of number of boosting iterations
maic <- AIC(model)
maic
### inspect coefficient path and AIC-based stopping criterion
par(mai = par("mai") * c(1, 1, 1, 1.8))
plot(model)
abline(v = mstop(maic), col = "lightgray")
### 10-fold cross-validation
cv10f <- cv(model.weights(model), type = "kfold")
cvm <- cvrisk(model, folds = cv10f, papply = lapply)
print(cvm)
mstop(cvm)
plot(cvm)
### 25 bootstrap iterations (manually)
set.seed(290875)
n <- nrow(bodyfat)
bs25 <- rmultinom(25, n, rep(1, n)/n)
cvm <- cvrisk(model, folds = bs25, papply = lapply)
print(cvm)
mstop(cvm)
plot(cvm)
### same by default
set.seed(290875)
cvrisk(model, papply = lapply)
### 25 bootstrap iterations (using cv)
set.seed(290875)
bs25_2 <- cv(model.weights(model), type="bootstrap")
all(bs25 == bs25_2)
## Not run:
############################################################
## Do not run this example automatically as it takes
## some time (~ 5 seconds depending on the system)
### trees
blackbox <- blackboost(DEXfat ~ ., data = bodyfat)
cvtree <- cvrisk(blackbox, papply = lapply)
plot(cvtree)
## End(Not run this automatically)
## End(Not run)
### cvrisk in parallel modes:
## Not run:
## at least not automatically
## parallel::mclapply() which is used here for parallelization only runs
## on unix systems (here we use 2 cores)
cvrisk(model, mc.cores = 2)
## infrastructure needs to be set up in advance
cl <- makeCluster(25) # e.g. to run cvrisk on 25 nodes via PVM
myApply <- function(X, FUN, ...) {
myFun <- function(...) {
library("mboost") # load mboost on nodes
FUN(...)
}
## further set up steps as required
parLapply(cl = cl, X, myFun, ...)
}
cvrisk(model, papply = myApply)
stopCluster(cl)
## End(Not run)