R: Cross-validate regression models for model selection

cross_validate {cvms}

R Documentation

Cross-validate regression models for model selection

Description

Cross-validate one or multiple linear or logistic regression models at once. Perform repeated cross-validation. Returns results in a tibble for easy comparison, reporting and further analysis.

See cross_validate_fn() for use with custom model functions.

Usage

cross_validate(
  data,
  formulas,
  family,
  fold_cols = ".folds",
  control = NULL,
  REML = FALSE,
  cutoff = 0.5,
  positive = 2,
  metrics = list(),
  preprocessing = NULL,
  rm_nc = FALSE,
  parallel = FALSE,
  verbose = FALSE,
  link = deprecated(),
  models = deprecated(),
  model_verbose = deprecated()
)

Arguments

data

data.frame.

Must include one or more grouping factors for identifying folds - as made with groupdata2::fold().

formulas

Model formulas as strings. (Character)

E.g. c("y~x", "y~z").

Can contain random effects.

E.g. c("y~x+(1|r)", "y~z+(1|r)").

family

Name of the family. (Character)

Currently supports "gaussian" for linear regression with lm() / lme4::lmer() and "binomial" for binary classification with glm() / lme4::glmer().

See cross_validate_fn() for use with other model functions.

fold_cols

Name(s) of grouping factor(s) for identifying folds. (Character)

Include names of multiple grouping factors for repeated cross-validation.

control

Construct control structures for mixed model fitting (with lme4::lmer() or lme4::glmer()). See lme4::lmerControl and lme4::glmerControl.

N.B. Ignored if fitting lm() or glm() models.

REML

Restricted Maximum Likelihood. (Logical)

cutoff

Threshold for predicted classes. (Numeric)

N.B. Binomial models only

positive

Level from dependent variable to predict. Either as character (preferable) or level index (1 or 2 - alphabetically).

E.g. if we have the levels "cat" and "dog" and we want "dog" to be the positive class, we can either provide "dog" or 2, as alphabetically, "dog" comes after "cat".

Note: For reproducibility, it's preferable to specify the name directly, as different locales may sort the levels differently.

Used when calculating confusion matrix metrics and creating ROC curves.

The Process column in the output can be used to verify this setting.

N.B. Only affects evaluation metrics, not the model training or returned predictions.

N.B. Binomial models only.

metrics

list for enabling/disabling metrics.

E.g. list("RMSE" = FALSE) would remove RMSE from the results, and list("Accuracy" = TRUE) would add the regular Accuracy metric to the classification results. Default values (TRUE/FALSE) will be used for the remaining available metrics.

You can enable/disable all metrics at once by including "all" = TRUE/FALSE in the list. This is done prior to enabling/disabling individual metrics, why list("all" = FALSE, "RMSE" = TRUE) would return only the RMSE metric.

The list can be created with gaussian_metrics() or binomial_metrics().

Also accepts the string "all".

preprocessing

Name of preprocessing to apply.

Available preprocessings are:

Name	Description
"standardize"	Centers and scales the numeric predictors.
"range"	Normalizes the numeric predictors to the `0`-`1` range. Values outside the min/max range in the test fold are truncated to `0`/`1`.
"scale"	Scales the numeric predictors to have a standard deviation of one.
"center"	Centers the numeric predictors to have a mean of zero.

The preprocessing parameters (mean, SD, etc.) are extracted from the training folds and applied to both the training folds and the test fold. They are returned in the Preprocess column for inspection.

N.B. The preprocessings should not affect the results to a noticeable degree, although "range" might due to the truncation.

rm_nc

Remove non-converged models from output. (Logical)

parallel

Whether to cross-validate the list of models in parallel. (Logical)

Remember to register a parallel backend first. E.g. with doParallel::registerDoParallel.

verbose

Whether to message process information like the number of model instances to fit and which model function was applied. (Logical)

link, models, model_verbose

Deprecated.

Details

Packages used:

Models

Gaussian: stats::lm, lme4::lmer

Binomial: stats::glm, lme4::glmer

Results

Shared

AIC : stats::AIC

AICc : MuMIn::AICc

BIC : stats::BIC

Gaussian

r2m : MuMIn::r.squaredGLMM

r2c : MuMIn::r.squaredGLMM

Binomial

ROC and AUC: pROC::roc

Value

tibble with results for each model.

Shared across families

A nested tibble with coefficients of the models from all iterations.

Number of total folds.

Number of fold columns.

Count of convergence warnings. Consider discarding models that did not converge on all iterations. Note: you might still see results, but these should be taken with a grain of salt!

Count of other warnings. These are warnings without keywords such as "convergence".

Count of Singular Fit messages. See lme4::isSingular for more information.

Nested tibble with the warnings and messages caught for each model.

A nested Process information object with information about the evaluation.

Name of dependent variable.

Names of fixed effects.

Names of random effects, if any.

Nested tibble with preprocessing parameters, if any.

—————————————————————-

Gaussian Results

—————————————————————-

Average RMSE, MAE, NRMSE(IQR), RRSE, RAE, RMSLE, AIC, AICc, and BIC of all the iterations*, omitting potential NAs from non-converged iterations. Note that the Information Criterion metrics (AIC, AICc, and BIC) are also averages.

See the additional metrics (disabled by default) at ?gaussian_metrics.

A nested tibble with the predictions and targets.

A nested tibble with the non-averaged results from all iterations.

* In repeated cross-validation, the metrics are first averaged for each fold column (repetition) and then averaged again.

—————————————————————-

Binomial Results

—————————————————————-

Based on the collected predictions from the test folds*, a confusion matrix and a ROC curve are created to get the following:

ROC:

AUC, Lower CI, and Upper CI

Confusion Matrix:

Balanced Accuracy, F1, Sensitivity, Specificity, Positive Predictive Value, Negative Predictive Value, Kappa, Detection Rate, Detection Prevalence, Prevalence, and MCC (Matthews correlation coefficient).

See the additional metrics (disabled by default) at ?binomial_metrics.

Also includes:

A nested tibble with predictions, predicted classes (depends on cutoff), and the targets. Note, that the predictions are not necessarily of the specified positive class, but of the model's positive class (second level of dependent variable, alphabetically).

The pROC::roc ROC curve object(s).

A nested tibble with the confusion matrix/matrices. The Pos_ columns tells you whether a row is a True Positive (TP), True Negative (TN), False Positive (FP), or False Negative (FN), depending on which level is the "positive" class. I.e. the level you wish to predict.

A nested tibble with the results from all fold columns.

The name of the Positive Class.

* In repeated cross-validation, an evaluation is made per fold column (repetition) and averaged.

Author(s)

Ludvig Renbo Olsen, r-pkgs@ludvigolsen.dk

Benjamin Hugh Zachariae

Examples


# Attach packages
library(cvms)
library(groupdata2) # fold()
library(dplyr) # %>% arrange()

# Data is part of cvms
data <- participant.scores

# Set seed for reproducibility
set.seed(7)

# Fold data
data <- fold(
  data,
  k = 4,
  cat_col = "diagnosis",
  id_col = "participant"
) %>%
  arrange(.folds)

#
# Cross-validate a single model
#

# Gaussian
cross_validate(
  data,
  formulas = "score~diagnosis",
  family = "gaussian",
  REML = FALSE
)

# Binomial
cross_validate(
  data,
  formulas = "diagnosis~score",
  family = "binomial"
)

#
# Cross-validate multiple models
#

formulas <- c(
  "score~diagnosis+(1|session)",
  "score~age+(1|session)"
)

cross_validate(
  data,
  formulas = formulas,
  family = "gaussian",
  REML = FALSE
)

#
# Use parallelization
#

# Attach doParallel and register four cores
# Uncomment:
# library(doParallel)
# registerDoParallel(4)

# Cross-validate a list of model formulas in parallel
# Make sure to uncomment the parallel argument
cross_validate(
  data,
  formulas = formulas,
  family = "gaussian"
  #, parallel = TRUE  # Uncomment
)

[Package cvms version 1.6.1 Index]

Cross-validate regression models for model selection

Description

Usage

Arguments

Details

Models

Results

Shared

Gaussian

Binomial

Value

Shared across families

Gaussian Results

Binomial Results

Author(s)

See Also

Examples