cvrisk.mboostLSS {gamboostLSS} | R Documentation |
Cross-Validation
Description
Multidimensional cross-validated estimation of the empirical risk for hyper-parameter selection.
Usage
## S3 method for class 'mboostLSS'
cvrisk(object, folds = cv(model.weights(object)),
grid = make.grid(mstop(object)), papply = mclapply,
trace = TRUE, mc.preschedule = FALSE, fun = NULL, ...)
make.grid(max, length.out = 10, min = NULL, log = TRUE,
dense_mu_grid = TRUE)
## S3 method for class 'nc_mboostLSS'
cvrisk(object, folds = cv(model.weights(object)),
grid = 1:sum(mstop(object)), papply = mclapply,
trace = TRUE, mc.preschedule = FALSE, fun = NULL, ...)
## S3 method for class 'cvriskLSS'
plot(x, type = c("heatmap", "lines"),
xlab = NULL, ylab = NULL, ylim = range(x),
main = attr(x, "type"), ...)
## S3 method for class 'nc_cvriskLSS'
plot(x, xlab = "Number of boosting iterations", ylab = NULL,
ylim = range(x), main = attr(x, "type"), ...)
Arguments
object |
an object of class |
folds |
a weight matrix with number of rows equal to the number of
observations. The number of columns corresponds to the number of
cross-validation runs. Can be computed using function
|
grid |
If the model was fitted with Otherwise (i.e., for |
papply |
(parallel) apply function, defaults to |
trace |
should status information beein printed during cross-validation?
Default: |
mc.preschedule |
preschedule tasks if are parallelized using |
fun |
if |
... |
additional arguments passed to |
max |
a named vector of length equal to the number of parameters of the GAMLSS
family (and names equal to the names of |
length.out |
the number of grid points (default: 10). This can be either a vector
of the same length as |
min |
minimal value of the grid. Per default the grid starts at 1 but
other values (smaller |
log |
should the grid be on a logarithmic scale? Default: |
dense_mu_grid |
should the grid in the |
x |
an object of class |
type |
should |
xlab , ylab |
user-specified labels for the x-axis and y-axis of the plot (which
are usually not needed). The defaults depend on the plot |
ylim |
limits of the y-axis. Only applicable for the line plot. |
main |
a title for the plots. |
Details
The number of boosting iterations is a hyper-parameter of the
boosting algorithms implemented in this package. Honest,
i.e., cross-validated, estimates of the empirical risk
for different stopping parameters mstop
are computed by
this function which can be utilized to choose an appropriate
number of boosting iterations to be applied. For details see
cvrisk.mboost
.
make.grid
eases the creation of an equidistand, integer-valued
grids, which can be used with cvrisk
. Per default, the grid is
equidistant on a logarithmic scale.
The line plot depicts the avarage risk for each grid point and additionally shows information on the variability of the risk from fold to fold. The heatmap shows only the average risk but in a nicer fashion.
For the method = "noncyclic"
only the line plot exists.
Hofner et al. (2016) provide a detailed description of
cross-validation for gamboostLSS
models and show a
worked example. Thomas et al. (2018) compare cross-validation for the
the cyclic and non-cyclic boosting approach and provide worked examples.
Value
An object of class cvriskLSS
or nc_cvriskLSS
for cyclic and
non-cyclic fitting, respectively, (when fun
wasn't specified);
Basically a matrix containing estimates of the empirical
risk for a varying number of bootstrap iterations. plot
and
print
methods are available as well as an mstop
method.
References
B. Hofner, A. Mayr, M. Schmid (2016). gamboostLSS: An R Package for Model Building and Variable Selection in the GAMLSS Framework. Journal of Statistical Software, 74(1), 1-31.
Available as vignette("gamboostLSS_Tutorial")
.
Thomas, J., Mayr, A., Bischl, B., Schmid, M., Smith, A., and Hofner, B. (2018),
Gradient boosting for distributional regression - faster tuning and improved
variable selection via noncyclical updates.
Statistics and Computing. 28: 673-687.
doi:10.1007/s11222-017-9754-6
(Preliminary version: https://arxiv.org/abs/1611.10171).
See Also
cvrisk.mboost
and cv
(both in package
mboost)
Examples
## Data generating process:
set.seed(1907)
x1 <- rnorm(1000)
x2 <- rnorm(1000)
x3 <- rnorm(1000)
x4 <- rnorm(1000)
x5 <- rnorm(1000)
x6 <- rnorm(1000)
mu <- exp(1.5 +1 * x1 +0.5 * x2 -0.5 * x3 -1 * x4)
sigma <- exp(-0.4 * x3 -0.2 * x4 +0.2 * x5 +0.4 * x6)
y <- numeric(1000)
for( i in 1:1000)
y[i] <- rnbinom(1, size = sigma[i], mu = mu[i])
dat <- data.frame(x1, x2, x3, x4, x5, x6, y)
## linear model with y ~ . for both components: 100 boosting iterations
model <- glmboostLSS(y ~ ., families = NBinomialLSS(), data = dat,
control = boost_control(mstop = 100),
center = TRUE)
## set up a grid
grid <- make.grid(mstop(model), length.out = 5, dense_mu_grid = FALSE)
plot(grid)
### Do not test the following code per default on CRAN as it takes some time to run:
### a tiny toy example (5-fold bootsrap with maximum stopping value 100)
## (to run it on multiple cores of a Linux or Mac OS computer remove
## set papply = mclapply (default) and set mc.nodes to the
## appropriate number of nodes)
cvr <- cvrisk(model, folds = cv(model.weights(model), B = 5),
papply = lapply, grid = grid)
cvr
## plot the results
par(mfrow = c(1, 2))
plot(cvr)
plot(cvr, type = "lines")
## extract optimal mstop (here: grid to small)
mstop(cvr)
### END (don't test automatically)
### Do not test the following code per default on CRAN as it takes some time to run:
### a more realistic example
grid <- make.grid(c(mu = 400, sigma = 400), dense_mu_grid = FALSE)
plot(grid)
cvr <- cvrisk(model, grid = grid)
mstop(cvr)
## set model to optimal values:
mstop(model) <- mstop(cvr)
### END (don't test automatically)
### Other grids:
plot(make.grid(mstop(model), length.out = 3, dense_mu_grid = FALSE))
plot(make.grid(c(mu = 400, sigma = 400), log = FALSE, dense_mu_grid = FALSE))
plot(make.grid(c(mu = 400, sigma = 400), length.out = 4,
min = 100, log = FALSE, dense_mu_grid = FALSE))
### Now use dense mu grids
# standard grid
plot(make.grid(c(mu = 100, sigma = 100), dense = FALSE),
pch = 20, col = "red")
# dense grid for all mstop_mu values greater than mstop_sigma
grid <- make.grid(c(mu = 100, sigma = 100))
points(grid, pch = 20, cex = 0.2)
abline(0,1)
# now with three parameters
grid <- make.grid(c(mu = 100, sigma = 100, df = 30),
length.out = c(5, 5, 2), dense = FALSE)
densegrid <- make.grid(c(mu = 100, sigma = 100, df = 30),
length.out = c(5, 5, 2))
par(mfrow = c(1,2))
# first for df = 1
plot(grid[grid$df == 1, 1:2], main = "df = 1", pch = 20, col = "red")
abline(0,1)
abline(v = 1)
# now expand grid for all mu values greater the corresponding sigma
# value (i.e. below the bisecting line) and above df (i.e. 1)
points(densegrid[densegrid$df == 1, 1:2], pch = 20, cex = 0.2)
# now for df = 30
plot(grid[grid$df == 30, 1:2], main = "df = 30", pch = 20, col = "red")
abline(0,1)
abline(v = 30)
# now expand grid for all mu values greater the corresponding sigma
# value (i.e. below the bisecting line) and above df (i.e. 30)
points(densegrid[densegrid$df == 30, 1:2], pch = 20, cex = 0.2)