R: Nested cross-validation for caret

nestcv.train {nestedcv}

R Documentation

Nested cross-validation for caret

Description

This function applies nested cross-validation (CV) to training of models using the caret package. The function also allows the option of embedded filtering of predictors for feature selection nested within the outer loop of CV. Predictions on the outer test folds are brought back together and error estimation/ accuracy determined. The default is 10x10 nested CV.

Usage

nestcv.train(
  y,
  x,
  method = "rf",
  filterFUN = NULL,
  filter_options = NULL,
  weights = NULL,
  balance = NULL,
  balance_options = NULL,
  modifyX = NULL,
  modifyX_useY = FALSE,
  modifyX_options = NULL,
  outer_method = c("cv", "LOOCV"),
  n_outer_folds = 10,
  n_inner_folds = 10,
  outer_folds = NULL,
  inner_folds = NULL,
  pass_outer_folds = FALSE,
  cv.cores = 1,
  multicore_fork = (Sys.info()["sysname"] != "Windows"),
  metric = ifelse(is.factor(y), "logLoss", "RMSE"),
  trControl = NULL,
  tuneGrid = NULL,
  savePredictions = "final",
  outer_train_predict = FALSE,
  finalCV = TRUE,
  na.option = "pass",
  verbose = TRUE,
  ...
)

Arguments

`y`	Response vector. For classification this should be a factor.
`x`	Matrix or dataframe of predictors
`method`	String specifying which model to use. See `caret::train()` for details.
`filterFUN`	Filter function, e.g. `ttest_filter()` or `relieff_filter()`. Any function can be provided and is passed `y` and `x`. Ideally returns a numeric vector with indices of filtered predictors. The custom function can return a character vector of names of the filtered predictors, but this will not work with the `penalty.factor` argument in `nestcv.glmnet()`.
`filter_options`	List of additional arguments passed to the filter function specified by `filterFUN`.
`weights`	Weights applied to each sample for models which can use weights. Note `weights` and `balance` cannot be used at the same time. Weights are not applied in filters.
`balance`	Specifies method for dealing with imbalanced class data. Current options are `"randomsample"` or `"smote"`. See `randomsample()` and `smote()`
`balance_options`	List of additional arguments passed to the balancing function
`modifyX`	Character string specifying the name of a function to modify `x`. This can be an imputation function for replacing missing values, or a more complex function which alters or even adds columns to `x`. The required return value of this function depends on the `modifyX_useY` setting.
`modifyX_useY`	Logical value whether the `x` modifying function makes use of response training data from `y`. If `FALSE` then the `modifyX` function simply needs to return a modified `x` object. If `TRUE` then the `modifyX` function must return a model type object on which `predict()` can be called, so that train and test partitions of `x` can be modified independently.
`modifyX_options`	List of additional arguments passed to the `x` modifying function
`outer_method`	String of either `"cv"` or `"LOOCV"` specifying whether to do k-fold CV or leave one out CV (LOOCV) for the outer folds
`n_outer_folds`	Number of outer CV folds
`n_inner_folds`	Sets number of inner CV folds. Note if `trControl` or `inner_folds` is specified then these supersede `n_inner_folds`.
`outer_folds`	Optional list containing indices of test folds for outer CV. If supplied, `n_outer_folds` is ignored.
`inner_folds`	Optional list of test fold indices for inner CV. This must be structured as a list of the outer folds each containing a list of inner folds. Can only be supplied if balancing is not applied. If supplied, `n_inner_folds` is ignored.
`pass_outer_folds`	Logical indicating whether the same outer folds are used for fitting of the final model when final CV is applied. Note this can only be applied when `n_outer_folds` and the number of inner CV folds specified in `n_inner_folds` or `trControl` are the same and that no balancing is applied.
`cv.cores`	Number of cores for parallel processing of the outer loops.
`multicore_fork`	Logical whether to use forked multicore parallel processing. Forked multicore processing uses `parallel::mclapply`. It is only available on unix/mac as windows does not allow forking. It is set to `FALSE` by default in windows and `TRUE` in unix/mac. Non-forked parallel processing is executed using `parallel::parLapply` or `pbapply::pblapply` if `verbose` is `TRUE`.
`metric`	A string that specifies what summary metric will be used to select the optimal model. By default, "logLoss" is used for classification and "RMSE" is used for regression. Note this differs from the default setting in caret which uses "Accuracy" for classification. See details.
`trControl`	A list of values generated by the `caret` function `caret::trainControl()`. This defines how inner CV training through `caret` is performed. Default for the inner loop is 10-fold CV. Setting this argument overrules `n_inner_folds`. See http://topepo.github.io/caret/using-your-own-model-in-train.html.
`tuneGrid`	Data frame of tuning values, see `caret::train()`.
`savePredictions`	Indicates whether hold-out predictions for each inner CV fold should be saved for ROC curves, accuracy etc see caret::trainControl. Default is `"final"` to capture predictions for inner CV ROC.
`outer_train_predict`	Logical whether to save predictions on outer training folds to calculate performance on outer training folds.
`finalCV`	Logical whether to perform one last round of CV on the whole dataset to determine the final model parameters. If set to `FALSE`, the median of the best hyperparameters from outer CV folds for continuous/ ordinal hyperparameters, or highest voted for categorical hyperparameters, are used to fit the final model. Performance metrics are independent of this last step. If set to `NA`, final model fitting is skipped altogether, which gives a useful speed boost if performance metrics are all that is needed.
`na.option`	Character value specifying how `NA`s are dealt with. `"omit"` is equivalent to `na.action = na.omit`. `"omitcol"` removes cases if there are `NA` in 'y', but columns (predictors) containing `NA` are removed from 'x' to preserve cases. Any other value means that `NA` are ignored (a message is given).
`verbose`	Logical whether to print messages and show progress
`...`	Arguments passed to `caret::train()`

Details

When finalCV = TRUE, the final fit on the whole data using is performed first. This helps flag errors generated by caret such as missing packages. Parallelisation of the final fit when finalCV = TRUE is performed in caret using registerDoParallel. caret itself uses foreach.

Parallelisation is performed on the outer CV folds using parallel::mclapply by default on unix/mac and parallel::parLapply on windows. mclapply uses forking which is faster. But some models use multi-threading which may cause issues in some circumstances with forked multicore processing. Setting multicore_fork to FALSE is slower but can alleviate some caret errors.

If the outer folds are run using parallelisation, then parallelisation in caret must be off, otherwise an error will be generated. Alternatively if you wish to use parallelisation in caret, then parallelisation in nestcv.train can be fully disabled by leaving cv.cores = 1.

xgboost models fitted via caret using method = "xgbTree" or "xgbLinear" invoke openMP multithreading on linux/windows by default which causes nestcv.train to fail when cv.cores >1 (nested parallelisation). Mac OS is unaffected. In order to prevent this, nestcv.train() sets openMP threads to 1 if cv.cores >1.

For classification, metric defaults to using 'logLoss' with the trControl arguments ⁠classProbs = TRUE, summaryFunction = mnLogLoss⁠, rather than 'Accuracy' which is the default classification metric in caret. See caret::trainControl(). LogLoss is arguably more consistent than Accuracy for tuning parameters in datasets with small sample size.

Models can be fitted with a single set of fixed parameters, in which case trControl defaults to trainControl(method = "none") which disables inner CV as it is unnecessary. See https://topepo.github.io/caret/model-training-and-tuning.html#fitting-models-without-parameter-tuning

Value

An object with S3 class "nestcv.train"

`call`	the matched call
`output`	Predictions on the left-out outer folds
`outer_result`	List object of results from each outer fold containing predictions on left-out outer folds, caret result and number of filtered predictors at each fold.
`outer_folds`	List of indices of outer test folds
`dimx`	dimensions of `x`
`xsub`	subset of `x` containing all predictors used in both outer CV folds and the final model
`y`	original response vector
`yfinal`	final response vector (post-balancing)
`final_fit`	Final fitted caret model using best tune parameters
`final_vars`	Column names of filtered predictors entering final model
`summary_vars`	Summary statistics of filtered predictors
`roc`	ROC AUC for binary classification where available.
`trControl`	`caret::trainControl` object used for inner CV
`bestTunes`	best tuned parameters from each outer fold
`finalTune`	final parameters used for final model
`summary`	Overall performance summary. Accuracy and balanced accuracy for classification. ROC AUC for binary classification. RMSE for regression.

Author(s)

Myles Lewis

Examples


## sigmoid function
sigmoid <- function(x) {1 / (1 + exp(-x))}

## load iris dataset and simulate a binary outcome
data(iris)
x <- iris[, 1:4]
colnames(x) <- c("marker1", "marker2", "marker3", "marker4")
x <- as.data.frame(apply(x, 2, scale))
y2 <- sigmoid(0.5 * x$marker1 + 2 * x$marker2) > runif(nrow(x))
y2 <- factor(y2, labels = c("class1", "class2"))

## Example using random forest with caret
cvrf <- nestcv.train(y2, x, method = "rf",
                     n_outer_folds = 3,
                     cv.cores = 2)
summary(cvrf)

## Example of glmnet tuned using caret
## set up small tuning grid for quick execution
## length.out of 20-100 is usually recommended for lambda
## and more alpha values ranging from 0-1
tg <- expand.grid(lambda = exp(seq(log(2e-3), log(1e0), length.out = 5)),
                  alpha = 1)

ncv <- nestcv.train(y = y2, x = x,
                    method = "glmnet",
                    n_outer_folds = 3,
                    tuneGrid = tg, cv.cores = 2)
summary(ncv)

## plot tuning for outer fold #1
plot(ncv$outer_result[[1]]$fit, xTrans = log)

## plot final ROC curve
plot(ncv$roc)

## plot ROC for left-out inner folds
inroc <- innercv_roc(ncv)
plot(inroc)

## example to show use of custom fold indices for 5 x 5-fold nested CV
library(caret)
y <- iris$Species
out_folds <- createFolds(y, k = 5)
in_folds <- lapply(out_folds, function(i) {
  ytrain <- y[-i]
  createFolds(ytrain, k = 5)
})

res <- nestcv.train(y, x, method="rf", cv.cores = 2,
                    pass_outer_folds = TRUE,
                    inner_folds = in_folds,
                    outer_folds = out_folds)
summary(res)
res$outer_folds
res$final_fit$control$indexOut  # same as outer_folds

[Package nestedcv version 0.7.9 Index]