R: Cross-Validation

cvrisk {mboost}

R Documentation

Cross-Validation

Description

Cross-validated estimation of the empirical risk for hyper-parameter selection.

Usage

## S3 method for class 'mboost'
cvrisk(object, folds = cv(model.weights(object)),
       grid = 0:mstop(object),
       papply = mclapply,
       fun = NULL, mc.preschedule = FALSE, ...)
cv(weights, type = c("bootstrap", "kfold", "subsampling"),
   B = ifelse(type == "kfold", 10, 25), prob = 0.5, strata = NULL)

## Plot cross-valiation results   
## S3 method for class 'cvrisk'
plot(x, 
     xlab = "Number of boosting iterations", ylab = attr(x, "risk"),
     ylim = range(x), main = attr(x, "type"), ...)

Arguments

`object`	an object of class `mboost`.
`folds`	a weight matrix with number of rows equal to the number of observations. The number of columns corresponds to the number of cross-validation runs. Can be computed using function `cv` and defaults to 25 bootstrap samples.
`grid`	a vector of stopping parameters the empirical risk is to be evaluated for.
`papply`	(parallel) apply function, defaults to `mclapply`. Alternatively, `parLapply` can be used. In the latter case, usually more setup is needed (see example for some details). To run `cvrisk` sequentially (i.e. not in parallel), one can use `lapply`.
`fun`	if `fun` is NULL, the out-of-sample risk is returned. `fun`, as a function of `object`, may extract any other characteristic of the cross-validated models. These are returned as is.
`mc.preschedule`	preschedule tasks if are parallelized using `mclapply` (default: `FALSE`)? For details see `mclapply`.
`weights`	a numeric vector of weights for the model to be cross-validated.
`type`	character argument for specifying the cross-validation method. Currently (stratified) bootstrap, k-fold cross-validation and subsampling are implemented.
`B`	number of folds, per default 25 for `bootstrap` and `subsampling` and 10 for `kfold`.
`prob`	percentage of observations to be included in the learning samples for subsampling.
`strata`	a factor of the same length as `weights` for stratification.
`x`	an object of class `cvrisk`.
`xlab`, `ylab`	axis labels.
`ylim`	limits of y-axis.
`main`	main title of graphic.
`...`	additional arguments passed to `mclapply` or `plot`.

Details

The number of boosting iterations is a hyper-parameter of the boosting algorithms implemented in this package. Honest, i.e., cross-validated, estimates of the empirical risk for different stopping parameters mstop are computed by this function which can be utilized to choose an appropriate number of boosting iterations to be applied.

Different forms of cross-validation can be applied, for example 10-fold cross-validation or bootstrapping. The weights (zero weights correspond to test cases) are defined via the folds matrix.

cvrisk runs in parallel on OSes where forking is possible (i.e., not on Windows) and multiple cores/processors are available. The scheduling can be changed by the corresponding arguments of mclapply (via the dot arguments).

The function cv can be used to build an appropriate weight matrix to be used with cvrisk. If strata is defined sampling is performed in each stratum separately thus preserving the distribution of the strata variable in each fold.

There exist various functions to display and work with cross-validation results. One can print and plot (see above) results and extract the optimal iteration via mstop.

Value

An object of class cvrisk (when fun wasn't specified), basically a matrix containing estimates of the empirical risk for a varying number of bootstrap iterations. plot and print methods are available as well as a mstop method.

References

Torsten Hothorn, Friedrich Leisch, Achim Zeileis and Kurt Hornik (2006), The design and analysis of benchmark experiments. Journal of Computational and Graphical Statistics, 14(3), 675–699.

Andreas Mayr, Benjamin Hofner, and Matthias Schmid (2012). The importance of knowing when to stop - a sequential stopping rule for component-wise gradient boosting. Methods of Information in Medicine, 51, 178–186.
DOI: doi:10.3414/ME11-02-0030

Examples


  data("bodyfat", package = "TH.data")

  ### fit linear model to data
  model <- glmboost(DEXfat ~ ., data = bodyfat, center = TRUE)

  ### AIC-based selection of number of boosting iterations
  maic <- AIC(model)
  maic

  ### inspect coefficient path and AIC-based stopping criterion
  par(mai = par("mai") * c(1, 1, 1, 1.8))
  plot(model)
  abline(v = mstop(maic), col = "lightgray")

  ### 10-fold cross-validation
  cv10f <- cv(model.weights(model), type = "kfold")
  cvm <- cvrisk(model, folds = cv10f, papply = lapply)
  print(cvm)
  mstop(cvm)
  plot(cvm)

  ### 25 bootstrap iterations (manually)
  set.seed(290875)
  n <- nrow(bodyfat)
  bs25 <- rmultinom(25, n, rep(1, n)/n)
  cvm <- cvrisk(model, folds = bs25, papply = lapply)
  print(cvm)
  mstop(cvm)
  plot(cvm)

  ### same by default
  set.seed(290875)
  cvrisk(model, papply = lapply)

  ### 25 bootstrap iterations (using cv)
  set.seed(290875)
  bs25_2 <- cv(model.weights(model), type="bootstrap")
  all(bs25 == bs25_2)

## Not run: 
############################################################
## Do not run this example automatically as it takes
## some time (~ 5 seconds depending on the system)

  ### trees
  blackbox <- blackboost(DEXfat ~ ., data = bodyfat)
  cvtree <- cvrisk(blackbox, papply = lapply)
  plot(cvtree)
  
## End(Not run this automatically)  

## End(Not run)


### cvrisk in parallel modes:

## Not run: 
## at least not automatically

## parallel::mclapply() which is used here for parallelization only runs 
## on unix systems (here we use 2 cores)

    cvrisk(model, mc.cores = 2)

## infrastructure needs to be set up in advance

    cl <- makeCluster(25) # e.g. to run cvrisk on 25 nodes via PVM
    myApply <- function(X, FUN, ...) {
      myFun <- function(...) {
          library("mboost") # load mboost on nodes
          FUN(...)
      }
      ## further set up steps as required
      parLapply(cl = cl, X, myFun, ...)
    }
    cvrisk(model, papply = myApply)
    stopCluster(cl)

## End(Not run)

[Package mboost version 2.9-10 Index]