nestcv.train {nestedcv} | R Documentation |
Nested cross-validation for caret
Description
This function applies nested cross-validation (CV) to training of models
using the caret
package. The function also allows the option of embedded
filtering of predictors for feature selection nested within the outer loop of
CV. Predictions on the outer test folds are brought back together and error
estimation/ accuracy determined. The default is 10x10 nested CV.
Usage
nestcv.train(
y,
x,
method = "rf",
filterFUN = NULL,
filter_options = NULL,
weights = NULL,
balance = NULL,
balance_options = NULL,
modifyX = NULL,
modifyX_useY = FALSE,
modifyX_options = NULL,
outer_method = c("cv", "LOOCV"),
n_outer_folds = 10,
n_inner_folds = 10,
outer_folds = NULL,
inner_folds = NULL,
pass_outer_folds = FALSE,
cv.cores = 1,
multicore_fork = (Sys.info()["sysname"] != "Windows"),
metric = ifelse(is.factor(y), "logLoss", "RMSE"),
trControl = NULL,
tuneGrid = NULL,
savePredictions = "final",
outer_train_predict = FALSE,
finalCV = TRUE,
na.option = "pass",
verbose = TRUE,
...
)
Arguments
y |
Response vector. For classification this should be a factor. |
x |
Matrix or dataframe of predictors |
method |
String specifying which model to use. See |
filterFUN |
Filter function, e.g. |
filter_options |
List of additional arguments passed to the filter
function specified by |
weights |
Weights applied to each sample for models which can use
weights. Note |
balance |
Specifies method for dealing with imbalanced class data.
Current options are |
balance_options |
List of additional arguments passed to the balancing function |
modifyX |
Character string specifying the name of a function to modify
|
modifyX_useY |
Logical value whether the |
modifyX_options |
List of additional arguments passed to the |
outer_method |
String of either |
n_outer_folds |
Number of outer CV folds |
n_inner_folds |
Sets number of inner CV folds. Note if |
outer_folds |
Optional list containing indices of test folds for outer
CV. If supplied, |
inner_folds |
Optional list of test fold indices for inner CV. This must
be structured as a list of the outer folds each containing a list of inner
folds. Can only be supplied if balancing is not applied. If supplied,
|
pass_outer_folds |
Logical indicating whether the same outer folds are
used for fitting of the final model when final CV is applied. Note this can
only be applied when |
cv.cores |
Number of cores for parallel processing of the outer loops. |
multicore_fork |
Logical whether to use forked multicore parallel
processing. Forked multicore processing uses |
metric |
A string that specifies what summary metric will be used to select the optimal model. By default, "logLoss" is used for classification and "RMSE" is used for regression. Note this differs from the default setting in caret which uses "Accuracy" for classification. See details. |
trControl |
A list of values generated by the |
tuneGrid |
Data frame of tuning values, see |
savePredictions |
Indicates whether hold-out predictions for each inner
CV fold should be saved for ROC curves, accuracy etc see
caret::trainControl. Default is |
outer_train_predict |
Logical whether to save predictions on outer training folds to calculate performance on outer training folds. |
finalCV |
Logical whether to perform one last round of CV on the whole
dataset to determine the final model parameters. If set to |
na.option |
Character value specifying how |
verbose |
Logical whether to print messages and show progress |
... |
Arguments passed to |
Details
When finalCV = TRUE
, the final fit on the whole data using is performed
first. This helps flag errors generated by caret
such as missing packages.
Parallelisation of the final fit when finalCV = TRUE
is performed in
caret
using registerDoParallel
. caret
itself uses foreach
.
Parallelisation is performed on the outer CV folds using parallel::mclapply
by default on unix/mac and parallel::parLapply
on windows. mclapply
uses
forking which is faster. But some models use multi-threading which may cause
issues in some circumstances with forked multicore processing. Setting
multicore_fork
to FALSE
is slower but can alleviate some caret errors.
If the outer folds are run using parallelisation, then parallelisation in
caret must be off, otherwise an error will be generated. Alternatively if you
wish to use parallelisation in caret, then parallelisation in nestcv.train
can be fully disabled by leaving cv.cores = 1
.
xgboost models fitted via caret using method = "xgbTree"
or "xgbLinear"
invoke openMP multithreading on linux/windows by default which causes
nestcv.train
to fail when cv.cores
>1 (nested parallelisation). Mac OS is
unaffected. In order to prevent this, nestcv.train()
sets openMP threads to
1 if cv.cores
>1.
For classification, metric
defaults to using 'logLoss' with the trControl
arguments classProbs = TRUE, summaryFunction = mnLogLoss
, rather than
'Accuracy' which is the default classification metric in caret
. See
caret::trainControl()
. LogLoss is arguably more consistent than Accuracy
for tuning parameters in datasets with small sample size.
Models can be fitted with a single set of fixed parameters, in which case
trControl
defaults to trainControl(method = "none")
which disables inner
CV as it is unnecessary. See
https://topepo.github.io/caret/model-training-and-tuning.html#fitting-models-without-parameter-tuning
Value
An object with S3 class "nestcv.train"
call |
the matched call |
output |
Predictions on the left-out outer folds |
outer_result |
List object of results from each outer fold containing predictions on left-out outer folds, caret result and number of filtered predictors at each fold. |
outer_folds |
List of indices of outer test folds |
dimx |
dimensions of |
xsub |
subset of |
y |
original response vector |
yfinal |
final response vector (post-balancing) |
final_fit |
Final fitted caret model using best tune parameters |
final_vars |
Column names of filtered predictors entering final model |
summary_vars |
Summary statistics of filtered predictors |
roc |
ROC AUC for binary classification where available. |
trControl |
|
bestTunes |
best tuned parameters from each outer fold |
finalTune |
final parameters used for final model |
summary |
Overall performance summary. Accuracy and balanced accuracy for classification. ROC AUC for binary classification. RMSE for regression. |
Author(s)
Myles Lewis
Examples
## sigmoid function
sigmoid <- function(x) {1 / (1 + exp(-x))}
## load iris dataset and simulate a binary outcome
data(iris)
x <- iris[, 1:4]
colnames(x) <- c("marker1", "marker2", "marker3", "marker4")
x <- as.data.frame(apply(x, 2, scale))
y2 <- sigmoid(0.5 * x$marker1 + 2 * x$marker2) > runif(nrow(x))
y2 <- factor(y2, labels = c("class1", "class2"))
## Example using random forest with caret
cvrf <- nestcv.train(y2, x, method = "rf",
n_outer_folds = 3,
cv.cores = 2)
summary(cvrf)
## Example of glmnet tuned using caret
## set up small tuning grid for quick execution
## length.out of 20-100 is usually recommended for lambda
## and more alpha values ranging from 0-1
tg <- expand.grid(lambda = exp(seq(log(2e-3), log(1e0), length.out = 5)),
alpha = 1)
ncv <- nestcv.train(y = y2, x = x,
method = "glmnet",
n_outer_folds = 3,
tuneGrid = tg, cv.cores = 2)
summary(ncv)
## plot tuning for outer fold #1
plot(ncv$outer_result[[1]]$fit, xTrans = log)
## plot final ROC curve
plot(ncv$roc)
## plot ROC for left-out inner folds
inroc <- innercv_roc(ncv)
plot(inroc)
## example to show use of custom fold indices for 5 x 5-fold nested CV
library(caret)
y <- iris$Species
out_folds <- createFolds(y, k = 5)
in_folds <- lapply(out_folds, function(i) {
ytrain <- y[-i]
createFolds(ytrain, k = 5)
})
res <- nestcv.train(y, x, method="rf", cv.cores = 2,
pass_outer_folds = TRUE,
inner_folds = in_folds,
outer_folds = out_folds)
summary(res)
res$outer_folds
res$final_fit$control$indexOut # same as outer_folds