nestcv.train {nestedcv}R Documentation

Nested cross-validation for caret

Description

This function applies nested cross-validation (CV) to training of models using the caret package. The function also allows the option of embedded filtering of predictors for feature selection nested within the outer loop of CV. Predictions on the outer test folds are brought back together and error estimation/ accuracy determined. The default is 10x10 nested CV.

Usage

nestcv.train(
  y,
  x,
  method = "rf",
  filterFUN = NULL,
  filter_options = NULL,
  weights = NULL,
  balance = NULL,
  balance_options = NULL,
  modifyX = NULL,
  modifyX_useY = FALSE,
  modifyX_options = NULL,
  outer_method = c("cv", "LOOCV"),
  n_outer_folds = 10,
  n_inner_folds = 10,
  outer_folds = NULL,
  inner_folds = NULL,
  pass_outer_folds = FALSE,
  cv.cores = 1,
  multicore_fork = (Sys.info()["sysname"] != "Windows"),
  metric = ifelse(is.factor(y), "logLoss", "RMSE"),
  trControl = NULL,
  tuneGrid = NULL,
  savePredictions = "final",
  outer_train_predict = FALSE,
  finalCV = TRUE,
  na.option = "pass",
  verbose = TRUE,
  ...
)

Arguments

y

Response vector. For classification this should be a factor.

x

Matrix or dataframe of predictors

method

String specifying which model to use. See caret::train() for details.

filterFUN

Filter function, e.g. ttest_filter() or relieff_filter(). Any function can be provided and is passed y and x. Must return a character vector with names of filtered predictors.

filter_options

List of additional arguments passed to the filter function specified by filterFUN.

weights

Weights applied to each sample for models which can use weights. Note weights and balance cannot be used at the same time. Weights are not applied in filters.

balance

Specifies method for dealing with imbalanced class data. Current options are "randomsample" or "smote". See randomsample() and smote()

balance_options

List of additional arguments passed to the balancing function

modifyX

Character string specifying the name of a function to modify x. This can be an imputation function for replacing missing values, or a more complex function which alters or even adds columns to x. The required return value of this function depends on the modifyX_useY setting.

modifyX_useY

Logical value whether the x modifying function makes use of response training data from y. If FALSE then the modifyX function simply needs to return a modified x object. If TRUE then the modifyX function must return a model type object on which predict() can be called, so that train and test partitions of x can be modified independently.

modifyX_options

List of additional arguments passed to the x modifying function

outer_method

String of either "cv" or "LOOCV" specifying whether to do k-fold CV or leave one out CV (LOOCV) for the outer folds

n_outer_folds

Number of outer CV folds

n_inner_folds

Sets number of inner CV folds. Note if trControl or inner_folds is specified then these supersede n_inner_folds.

outer_folds

Optional list containing indices of test folds for outer CV. If supplied, n_outer_folds is ignored.

inner_folds

Optional list of test fold indices for inner CV. This must be structured as a list of the outer folds each containing a list of inner folds. Can only be supplied if balancing is not applied. If supplied, n_inner_folds is ignored.

pass_outer_folds

Logical indicating whether the same outer folds are used for fitting of the final model when final CV is applied. Note this can only be applied when n_outer_folds and the number of inner CV folds specified in n_inner_folds or trControl are the same and that no balancing is applied.

cv.cores

Number of cores for parallel processing of the outer loops.

multicore_fork

Logical whether to use forked multicore parallel processing. Forked multicore processing uses parallel::mclapply. It is only available on unix/mac as windows does not allow forking. It is set to FALSE by default in windows and TRUE in unix/mac. Non-forked parallel processing is executed using parallel::parLapply or pbapply::pblapply if verbose is TRUE.

metric

A string that specifies what summary metric will be used to select the optimal model. By default, "logLoss" is used for classification and "RMSE" is used for regression. Note this differs from the default setting in caret which uses "Accuracy" for classification. See details.

trControl

A list of values generated by the caret function caret::trainControl(). This defines how inner CV training through caret is performed. Default for the inner loop is 10-fold CV. Setting this argument overrules n_inner_folds. See http://topepo.github.io/caret/using-your-own-model-in-train.html.

tuneGrid

Data frame of tuning values, see caret::train().

savePredictions

Indicates whether hold-out predictions for each inner CV fold should be saved for ROC curves, accuracy etc see caret::trainControl. Default is "final" to capture predictions for inner CV ROC.

outer_train_predict

Logical whether to save predictions on outer training folds to calculate performance on outer training folds.

finalCV

Logical whether to perform one last round of CV on the whole dataset to determine the final model parameters. If set to FALSE, the median of the best hyperparameters from outer CV folds for continuous/ ordinal hyperparameters, or highest voted for categorical hyperparameters, are used to fit the final model. Performance metrics are independent of this last step. If set to NA, final model fitting is skipped altogether, which gives a useful speed boost if performance metrics are all that is needed.

na.option

Character value specifying how NAs are dealt with. "omit" is equivalent to na.action = na.omit. "omitcol" removes cases if there are NA in 'y', but columns (predictors) containing NA are removed from 'x' to preserve cases. Any other value means that NA are ignored (a message is given).

verbose

Logical whether to print messages and show progress

...

Arguments passed to caret::train()

Details

When finalCV = TRUE, the final fit on the whole data using is performed first. This helps flag errors generated by caret such as missing packages. Parallelisation of the final fit when finalCV = TRUE is performed in caret using registerDoParallel. caret itself uses foreach.

Parallelisation is performed on the outer CV folds using parallel::mclapply by default on unix/mac and parallel::parLapply on windows. mclapply uses forking which is faster. But some models use multi-threading which may cause issues in some circumstances with forked multicore processing. Setting multicore_fork to FALSE is slower but can alleviate some caret errors.

If the outer folds are run using parallelisation, then parallelisation in caret must be off, otherwise an error will be generated. Alternatively if you wish to use parallelisation in caret, then parallelisation in nestcv.train can be fully disabled by leaving cv.cores = 1.

xgboost models fitted via caret using method = "xgbTree" or "xgbLinear" invoke openMP multithreading on linux/windows by default which causes nestcv.train to fail when cv.cores >1 (nested parallelisation). Mac OS is unaffected. In order to prevent this, nestcv.train() sets openMP threads to 1 if cv.cores >1.

For classification, metric defaults to using 'logLoss' with the trControl arguments ⁠classProbs = TRUE, summaryFunction = mnLogLoss⁠, rather than 'Accuracy' which is the default classification metric in caret. See caret::trainControl(). LogLoss is arguably more consistent than Accuracy for tuning parameters in datasets with small sample size.

Models can be fitted with a single set of fixed parameters, in which case trControl defaults to trainControl(method = "none") which disables inner CV as it is unnecessary. See https://topepo.github.io/caret/model-training-and-tuning.html#fitting-models-without-parameter-tuning

Value

An object with S3 class "nestcv.train"

call

the matched call

output

Predictions on the left-out outer folds

outer_result

List object of results from each outer fold containing predictions on left-out outer folds, caret result and number of filtered predictors at each fold.

outer_folds

List of indices of outer test folds

dimx

dimensions of x

xsub

subset of x containing all predictors used in both outer CV folds and the final model

y

original response vector

yfinal

final response vector (post-balancing)

final_fit

Final fitted caret model using best tune parameters

final_vars

Column names of filtered predictors entering final model

summary_vars

Summary statistics of filtered predictors

roc

ROC AUC for binary classification where available.

trControl

caret::trainControl object used for inner CV

bestTunes

best tuned parameters from each outer fold

finalTune

final parameters used for final model

summary

Overall performance summary. Accuracy and balanced accuracy for classification. ROC AUC for binary classification. RMSE for regression.

Author(s)

Myles Lewis

Examples


## sigmoid function
sigmoid <- function(x) {1 / (1 + exp(-x))}

## load iris dataset and simulate a binary outcome
data(iris)
x <- iris[, 1:4]
colnames(x) <- c("marker1", "marker2", "marker3", "marker4")
x <- as.data.frame(apply(x, 2, scale))
y2 <- sigmoid(0.5 * x$marker1 + 2 * x$marker2) > runif(nrow(x))
y2 <- factor(y2, labels = c("class1", "class2"))

## Example using random forest with caret
cvrf <- nestcv.train(y2, x, method = "rf",
                     n_outer_folds = 3,
                     cv.cores = 2)
summary(cvrf)

## Example of glmnet tuned using caret
## set up small tuning grid for quick execution
## length.out of 20-100 is usually recommended for lambda
## and more alpha values ranging from 0-1
tg <- expand.grid(lambda = exp(seq(log(2e-3), log(1e0), length.out = 5)),
                  alpha = 1)

ncv <- nestcv.train(y = y2, x = x,
                    method = "glmnet",
                    n_outer_folds = 3,
                    tuneGrid = tg, cv.cores = 2)
summary(ncv)

## plot tuning for outer fold #1
plot(ncv$outer_result[[1]]$fit, xTrans = log)

## plot final ROC curve
plot(ncv$roc)

## plot ROC for left-out inner folds
inroc <- innercv_roc(ncv)
plot(inroc)

## example to show use of custom fold indices for 5 x 5-fold nested CV
library(caret)
y <- iris$Species
out_folds <- createFolds(y, k = 5)
in_folds <- lapply(out_folds, function(i) {
  ytrain <- y[-i]
  createFolds(ytrain, k = 5)
})

res <- nestcv.train(y, x, method="rf", cv.cores = 2,
                    pass_outer_folds = TRUE,
                    inner_folds = in_folds,
                    outer_folds = out_folds)
summary(res)
res$outer_folds
res$final_fit$control$indexOut  # same as outer_folds


[Package nestedcv version 0.7.8 Index]