outercv {nestedcv} | R Documentation |
Outer cross-validation of selected models
Description
This is a convenience function designed to use a single loop of cross-validation to quickly evaluate performance of specific models (random forest, naive Bayes, lm, glm) with fixed hyperparameters and no tuning. If tuning of parameters on data is required, full nested CV with inner CV is needed to tune model hyperparameters (see nestcv.train).
Usage
outercv(y, ...)
## Default S3 method:
outercv(
y,
x,
model,
filterFUN = NULL,
filter_options = NULL,
weights = NULL,
balance = NULL,
balance_options = NULL,
modifyX = NULL,
modifyX_useY = FALSE,
modifyX_options = NULL,
outer_method = c("cv", "LOOCV"),
n_outer_folds = 10,
outer_folds = NULL,
cv.cores = 1,
multicore_fork = (Sys.info()["sysname"] != "Windows"),
predict_type = "prob",
outer_train_predict = FALSE,
returnList = FALSE,
final = TRUE,
na.option = "pass",
verbose = FALSE,
suppressMsg = verbose,
...
)
## S3 method for class 'formula'
outercv(
formula,
data,
model,
outer_method = c("cv", "LOOCV"),
n_outer_folds = 10,
outer_folds = NULL,
cv.cores = 1,
multicore_fork = (Sys.info()["sysname"] != "Windows"),
predict_type = "prob",
outer_train_predict = FALSE,
verbose = FALSE,
suppressMsg = verbose,
...,
na.action = na.fail
)
Arguments
y |
Response vector |
... |
Optional arguments passed to the function specified by |
x |
Matrix or dataframe of predictors |
model |
Character value or function of the model to be fitted. |
filterFUN |
Filter function, e.g. ttest_filter or relieff_filter.
Any function can be provided and is passed |
filter_options |
List of additional arguments passed to the filter
function specified by |
weights |
Weights applied to each sample for models which can use
weights. Note |
balance |
Specifies method for dealing with imbalanced class data.
Current options are |
balance_options |
List of additional arguments passed to the balancing function |
modifyX |
Character string specifying the name of a function to modify
|
modifyX_useY |
Logical value whether the |
modifyX_options |
List of additional arguments passed to the |
outer_method |
String of either |
n_outer_folds |
Number of outer CV folds |
outer_folds |
Optional list containing indices of test folds for outer
CV. If supplied, |
cv.cores |
Number of cores for parallel processing of the outer loops. |
multicore_fork |
Logical whether to use forked multicore parallel
processing. Forked multicore processing uses |
predict_type |
Only used with binary classification. Calculation of ROC
AUC requires predicted class probabilities from fitted models. Most model
functions use syntax of the form |
outer_train_predict |
Logical whether to save predictions on outer training folds to calculate performance on outer training folds. |
returnList |
Logical whether to return list of results after main outer CV loop without concatenating results. Useful for debugging. |
final |
Logical whether to fit final model. |
na.option |
Character value specifying how |
verbose |
Logical whether to print messages and show progress |
suppressMsg |
Logical whether to suppress messages and printed output from model functions. This is necessary when using forked multicore parallelisation. |
formula |
A formula describing the model to be fitted |
data |
A matrix or data frame containing variables in the model. |
na.action |
Formula S3 method only: a function to specify the action to
be taken if NAs are found. The default action is for the procedure to fail.
An alternative is |
Details
Some predictive model functions do not have an x & y interface. If the
function specified by model
requires a formula, x
& y
will be merged
into a dataframe with model()
called with a formula equivalent to
y ~ .
.
The S3 formula method for outercv
is not really recommended with large
data sets - it is envisaged to be primarily used to compare
performance of more basic models e.g. lm()
specified by formulae for
example incorporating interactions. NOTE: filtering is not available if
outercv
is called with a formula - use the x-y
interface instead.
An alternative method of tuning a single model with fixed parameters
is to use nestcv.train with tuneGrid
set as a single row of a
data.frame. The parameters which are needed for a specific model can be
identified using caret::modelLookup()
.
Case weights can be passed to model function which accept these, however
outercv
assumes that these are passed to the model via an argument named
weights
.
Note that in the case of model = "lm"
, although additional arguments e.g.
subset
, weights
, offset
are passed into the model function via
"..."
the scoping is known to go awry. Avoid using these arguments with
model = "lm"
.
NA
handling differs between the default S3 method and the formula S3
method. The na.option
argument takes a character string, while the more
typical na.action
argument takes a function.
Value
An object with S3 class "outercv"
call |
the matched call |
output |
Predictions on the left-out outer folds |
outer_result |
List object of results from each outer fold containing predictions on left-out outer folds, model result and number of filtered predictors at each fold. |
dimx |
vector of number of observations and number of predictors |
outer_folds |
List of indices of outer test folds |
final_fit |
Final fitted model on whole data |
final_vars |
Column names of filtered predictors entering final model |
roc |
ROC AUC for binary classification where available. |
summary |
Overall performance summary. Accuracy and balanced accuracy for classification. ROC AUC for binary classification. RMSE for regression. |
Examples
## Classification example
## sigmoid function
sigmoid <- function(x) {1 / (1 + exp(-x))}
# load iris dataset and simulate a binary outcome
data(iris)
dt <- iris[, 1:4]
colnames(dt) <- c("marker1", "marker2", "marker3", "marker4")
dt <- as.data.frame(apply(dt, 2, scale))
x <- dt
y2 <- sigmoid(0.5 * dt$marker1 + 2 * dt$marker2) > runif(nrow(dt))
y2 <- factor(y2)
## Random forest
library(randomForest)
cvfit <- outercv(y2, x, "randomForest")
summary(cvfit)
plot(cvfit$roc)
## Mixture discriminant analysis (MDA)
if (requireNamespace("mda", quietly = TRUE)) {
library(mda)
cvfit <- outercv(y2, x, "mda", predict_type = "posterior")
summary(cvfit)
}
## Example with continuous outcome
y <- -3 + 0.5 * dt$marker1 + 2 * dt$marker2 + rnorm(nrow(dt), 0, 2)
dt$outcome <- y
## simple linear model - formula interface
cvfit <- outercv(outcome ~ ., data = dt, model = "lm")
summary(cvfit)
## random forest for regression
cvfit <- outercv(y, x, "randomForest")
summary(cvfit)
## example with lm_filter() to reduce input predictors
cvfit <- outercv(y, x, "randomForest", filterFUN = lm_filter,
filter_options = list(nfilter = 2, p_cutoff = NULL))
summary(cvfit)