nestcv.glmnet {nestedcv} | R Documentation |
Nested cross-validation with glmnet
Description
This function enables nested cross-validation (CV) with glmnet including tuning of elastic net alpha parameter. The function also allows the option of embedded filtering of predictors for feature selection nested within the outer loop of CV. Predictions on the outer test folds are brought back together and error estimation/ accuracy determined. The default is 10x10 nested CV.
Usage
nestcv.glmnet(
y,
x,
family = c("gaussian", "binomial", "poisson", "multinomial", "cox", "mgaussian"),
filterFUN = NULL,
filter_options = NULL,
balance = NULL,
balance_options = NULL,
modifyX = NULL,
modifyX_useY = FALSE,
modifyX_options = NULL,
outer_method = c("cv", "LOOCV"),
n_outer_folds = 10,
n_inner_folds = 10,
outer_folds = NULL,
pass_outer_folds = FALSE,
alphaSet = seq(0.1, 1, 0.1),
min_1se = 0,
keep = TRUE,
outer_train_predict = FALSE,
weights = NULL,
penalty.factor = rep(1, ncol(x)),
cv.cores = 1,
finalCV = TRUE,
na.option = "omit",
verbose = FALSE,
...
)
Arguments
y |
Response vector or matrix. Matrix is only used for
|
x |
Matrix of predictors. Dataframes will be coerced to a matrix as is necessary for glmnet. |
family |
Either a character string representing one of the built-in
families, or else a |
filterFUN |
Filter function, e.g. ttest_filter or relieff_filter.
Any function can be provided and is passed |
filter_options |
List of additional arguments passed to the filter
function specified by |
balance |
Specifies method for dealing with imbalanced class data.
Current options are |
balance_options |
List of additional arguments passed to the balancing function |
modifyX |
Character string specifying the name of a function to modify
|
modifyX_useY |
Logical value whether the |
modifyX_options |
List of additional arguments passed to the |
outer_method |
String of either |
n_outer_folds |
Number of outer CV folds |
n_inner_folds |
Number of inner CV folds |
outer_folds |
Optional list containing indices of test folds for outer
CV. If supplied, |
pass_outer_folds |
Logical indicating whether the same outer folds are
used for fitting of the final model when final CV is applied. Note this can
only be applied when |
alphaSet |
Vector of alphas to be tuned |
min_1se |
Value from 0 to 1 specifying choice of optimal lambda from 0=lambda.min to 1=lambda.1se |
keep |
Logical indicating whether inner CV predictions are retained for
calculating left-out inner CV fold accuracy etc. See argument |
outer_train_predict |
Logical whether to save predictions on outer training folds to calculate performance on outer training folds. |
weights |
Weights applied to each sample. Note |
penalty.factor |
Separate penalty factors can be applied to each
coefficient. Can be 0 for some variables, which implies no shrinkage, and
that variable is always included in the model. Default is 1 for all
variables. See glmnet::glmnet. Note this works separately from filtering.
For some |
cv.cores |
Number of cores for parallel processing of the outer loops.
NOTE: this uses |
finalCV |
Logical whether to perform one last round of CV on the whole
dataset to determine the final model parameters. If set to |
na.option |
Character value specifying how |
verbose |
Logical whether to print messages and show progress |
... |
Optional arguments passed to glmnet::cv.glmnet |
Details
glmnet does not tolerate missing values, so na.option = "omit"
is the
default.
Value
An object with S3 class "nestcv.glmnet"
call |
the matched call |
output |
Predictions on the left-out outer folds |
outer_result |
List object of results from each outer fold containing predictions on left-out outer folds, best lambda, best alpha, fitted glmnet coefficients, list object of inner fitted cv.glmnet and number of filtered predictors at each fold. |
outer_method |
the |
n_inner_folds |
number of inner folds |
outer_folds |
List of indices of outer test folds |
dimx |
dimensions of |
xsub |
subset of |
y |
original response vector |
yfinal |
final response vector (post-balancing) |
final_param |
Final mean best lambda and alpha from each fold |
final_fit |
Final fitted glmnet model |
final_coef |
Final model coefficients and mean expression. Variables with coefficients shrunk to 0 are removed. |
final_vars |
Column names of filtered predictors entering final model. This is useful for subsetting new data for predictions. |
roc |
ROC AUC for binary classification where available. |
summary |
Overall performance summary. Accuracy and balanced accuracy for classification. ROC AUC for binary classification. RMSE for regression. |
Author(s)
Myles Lewis
Examples
## Example binary classification problem with P >> n
x <- matrix(rnorm(150 * 2e+04), 150, 2e+04) # predictors
y <- factor(rbinom(150, 1, 0.5)) # binary response
## Partition data into 2/3 training set, 1/3 test set
trainSet <- caret::createDataPartition(y, p = 0.66, list = FALSE)
## t-test filter using whole dataset
filt <- ttest_filter(y, x, nfilter = 100)
filx <- x[, filt]
## Train glmnet on training set only using filtered predictor matrix
library(glmnet)
fit <- cv.glmnet(filx[trainSet, ], y[trainSet], family = "binomial")
plot(fit)
## Predict response on test partition
predy <- predict(fit, newx = filx[-trainSet, ], s = "lambda.min", type = "class")
predy <- as.vector(predy)
predyp <- predict(fit, newx = filx[-trainSet, ], s = "lambda.min", type = "response")
predyp <- as.vector(predyp)
output <- data.frame(testy = y[-trainSet], predy = predy, predyp = predyp)
## Results on test partition
## shows bias since univariate filtering was applied to whole dataset
predSummary(output)
## Nested CV
## n_outer_folds reduced to speed up example
fit2 <- nestcv.glmnet(y, x, family = "binomial", alphaSet = 1,
n_outer_folds = 3,
filterFUN = ttest_filter,
filter_options = list(nfilter = 100),
cv.cores = 2)
summary(fit2)
plot_lambdas(fit2, showLegend = "bottomright")
## ROC plots
library(pROC)
testroc <- roc(output$testy, output$predyp, direction = "<")
inroc <- innercv_roc(fit2)
plot(fit2$roc)
lines(inroc, col = 'blue')
lines(testroc, col = 'red')
legend('bottomright', legend = c("Nested CV", "Left-out inner CV folds",
"Test partition, non-nested filtering"),
col = c("black", "blue", "red"), lty = 1, lwd = 2, bty = "n")