R: Cross-Validate Regression Models

cv {cv}

R Documentation

Cross-Validate Regression Models

Description

cv() is a parallelized generic k-fold (including n-fold, i.e., leave-one-out) cross-validation function, with a default method, specific methods for linear and generalized-linear models that can be much more computationally efficient, and a method for robust linear models. There are also cv() methods for mixed-effects models, for model-selection procedures, and for several models fit to the same data, which are documented separately.

Usage

cv(model, data, criterion, k, reps = 1L, seed, ...)

## Default S3 method:
cv(
  model,
  data = insight::get_data(model),
  criterion = mse,
  k = 10L,
  reps = 1L,
  seed = NULL,
  criterion.name = deparse(substitute(criterion)),
  details = k <= 10L,
  confint = n >= 400L,
  level = 0.95,
  ncores = 1L,
  type = "response",
  start = FALSE,
  model.function,
  ...
)

## S3 method for class 'lm'
cv(
  model,
  data = insight::get_data(model),
  criterion = mse,
  k = 10L,
  reps = 1L,
  seed = NULL,
  details = k <= 10L,
  confint = n >= 400L,
  level = 0.95,
  method = c("auto", "hatvalues", "Woodbury", "naive"),
  ncores = 1L,
  ...
)

## S3 method for class 'glm'
cv(
  model,
  data = insight::get_data(model),
  criterion = mse,
  k = 10L,
  reps = 1L,
  seed = NULL,
  details = k <= 10L,
  confint = n >= 400L,
  level = 0.95,
  method = c("exact", "hatvalues", "Woodbury"),
  ncores = 1L,
  start = FALSE,
  ...
)

## S3 method for class 'rlm'
cv(model, data, criterion, k, reps = 1L, seed, ...)

## S3 method for class 'cv'
print(x, digits = getOption("digits"), ...)

## S3 method for class 'cvList'
print(x, ...)

## S3 method for class 'cv'
as.data.frame(
  x,
  row.names = NULL,
  optional = TRUE,
  rows = c("cv", "folds"),
  columns = c("criteria", "coefficients"),
  ...
)

## S3 method for class 'cvList'
as.data.frame(x, row.names = NULL, optional = TRUE, ...)

## S3 method for class 'cvDataFrame'
print(x, digits = getOption("digits") - 2L, ...)

## S3 method for class 'cvDataFrame'
summary(
  object,
  formula,
  subset = NULL,
  fun = mean,
  include = c("cv", "folds", "all"),
  ...
)

Arguments

`model`	a regression model object (see Details).
`data`	data frame to which the model was fit (not usually necessary).
`criterion`	cross-validation criterion ("cost" or lack-of-fit) function of form `f(y, yhat)` where `y` is the observed values of the response and `yhat` the predicted values; the default is `mse` (the mean-squared error).
`k`	perform k-fold cross-validation (default is `10`); `k` may be a number or `"loo"` or `"n"` for n-fold (leave-one-out) cross-validation.
`reps`	number of times to replicate k-fold CV (default is `1`).
`seed`	for R's random number generator; optional, if not supplied a random seed will be selected and saved; not needed for n-fold cross-validation.
`...`	to match generic; passed to `predict()` for the default `cv()` method; passed to the `Tapply()` function in the car package for `summary.cvDataFrame()`.
`criterion.name`	a character string giving the name of the CV criterion function in the returned `"cv"` object (not usually needed).
`details`	if `TRUE` (the default if the number of folds `k <= 10`), save detailed information about the value of the CV criterion for the cases in each fold and the regression coefficients with that fold deleted.
`confint`	if `TRUE` (the default if the number of cases is 400 or greater), compute a confidence interval for the bias-corrected CV criterion, if the criterion is the average of casewise components.
`level`	confidence level (default `0.95`).
`ncores`	number of cores to use for parallel computations (default is `1`, i.e., computations aren't done in parallel).
`type`	for the default method, value to be passed to the `type` argument of `predict()`; the default is `type="response"`, which is appropriate, e.g., for a `"glm"` model and may be recognized or ignored by `predict()` methods for other model classes.
`start`	if `TRUE` (the default is `FALSE`), the `start` argument to `update()` is set to the vector of regression coefficients for the model fit to the full data, possibly making the CV updates faster, e.g., for a GLM.
`model.function`	a regression function, typically for a new `cv()` method that that calls `cv.default()` via `NextMethod()`, residing in a package that's not a declared dependency of the cv package, e.g., `nnet::multinom`. It's usually not necessary to specify `model.function` to make `cv.default()` work.
`method`	computational method to apply to a linear (i.e., `"lm"`) model or to a generalized linear (i.e., `"glm"`) model. See Details for an explanation of the available options.
`x`	a `"cv"`, `"cvList"`, or `"cvDataFrame"` object to be printed or coerced to a data frame.
`digits`	significant digits for printing, default taken from the `"digits"` option.
`row.names`	optional row names for the result, defaults to `NULL`.
`optional`	to match the `as.data.frame()` generic function; if `FALSE` (the default is `TRUE`), then the names of the columns of the returned data frame, including the names of coefficients, are coerced to syntactically correct names.
`rows`	the rows of the resulting data frame to retain: setting `rows="cv"` retains rows pertaining to the overall CV result (marked as "`fold 0`" ); setting `rows="folds"` retains rows pertaining to individual folds 1 through k; the default is `rows = c("cv", "folds")`, which retains all rows.
`columns`	the columns of the resulting data frame to retain: setting `columns="critera"` retains columns pertaining to CV criteria; setting `columns="coefficients"` retains columns pertaining to model coefficients (broadly construed); the default is `columns = c("criteria", "coefficients")`, which retains both; and the columns `"model"`, `"rep"`, and `"fold"`, if present, are always retained.
`object`	an object inheriting from `"cvDataFrame"` to summarize.
`formula`	of the form `some.criterion ~ classifying.variable(s)` (see examples).
`subset`	a subsetting expression; the default (`NULL`) is not to subset the `"cvDataFrame"` object.
`fun`	summary function to apply, defaulting to `mean`.
`include`	which rows of the `"cvDataFrame"` to include in the summary. One of `"cv"` (the default), rows representing the overall CV results; `"folds"`, rows for individual folds; `"all"`, all rows (generally not sensible).

Details

The default cv() method uses update() to refit the model to each fold, and should work if there are appropriate update() and predict() methods, and if the default method for GetResponse() works or if a GetResponse() method is supplied.

The "lm" and "glm" methods can use much faster computational algorithms, as selected by the method argument. The linear-model method accommodates weighted linear models.

For both classes of models, for the leave-one-out (n-fold) case, fitted values for the folds can be computed from the hat-values via method="hatvalues" without refitting the model; for GLMs, this method is approximate, for LMs it is exact.

Again for both classes of models, when more than one case is omitted in each fold, fitted values may be obtained without refitting the model by exploiting the Woodbury matrix identity via method="Woodbury". As for hatvalues, this method is exact for LMs and approximate for GLMs.

The default for linear models is method="auto", which is equivalent to method="hatvalues" for n-fold cross-validation and method="Woodbury" otherwise; method="naive" refits the model via update() and is generally much slower. The default for generalized linear models is method="exact", which employs update(). This default is conservative, and it is usually safe to use method="hatvalues" for n-fold CV or method="Woodbury" for k-fold CV.

There is also a method for robust linear models fit by rlm() in the MASS package (to avoid inheriting the "lm" method for which the default "auto" computational method would be inappropriate).

For additional details, see the "Cross-validating regression models" vignette (vignette("cv", package="cv")).

cv() is designed to be extensible to other classes of regression models; see the "Extending the cv package" vignette (vignette("cv-extend", package="cv")).

Value

The cv() methods return an object of class "cv", with the CV criterion ("CV crit"), the bias-adjusted CV criterion ("adj CV crit"), the criterion for the model applied to the full data ("full crit"), the confidence interval and level for the bias-adjusted CV criterion ("confint"), the number of folds ("k"), and the seed for R's random-number generator ("seed"). If details=TRUE, then the returned object will also include a "details" component, which is a list of two elements: "criterion", containing the CV criterion computed for the cases in each fold; and "coefficients", regression coefficients computed for the model with each fold deleted. Some methods may return a subset of these components and may add additional information. If reps > 1, then an object of class "cvList" is returned, which is literally a list of "cv" objects.

Methods (by class)

cv(default): "default" method.
cv(lm): "lm" method.
cv(glm): "glm" method.
cv(rlm): "rlm" method (to avoid inheriting the "lm" method).

Methods (by generic)

print(cv): print() method for "cv" objects.
as.data.frame(cv): as.data.frame() method for "cv" objects.

Functions

print(cvList): print() method for "cvList" objects.
as.data.frame(cvList): as.data.frame() method for "cvList" objects.
print(cvDataFrame): print() method for "cvDataFrame" objects.
summary(cvDataFrame): summary() method for "cvDataFrame" objects.

Examples

data("Auto", package="ISLR2")
m.auto <- lm(mpg ~ horsepower, data=Auto)
cv(m.auto,  k="loo")
(cv.auto <- cv(m.auto, seed=1234))
compareFolds(cv.auto)
(cv.auto.reps <- cv(m.auto, seed=1234, reps=3))
D.auto.reps <- as.data.frame(cv.auto.reps)
head(D.auto.reps)
summary(D.auto.reps, mse ~ rep + fold, include="folds")
summary(D.auto.reps, mse ~ rep + fold, include = "folds",
        subset = fold <= 5) # first 5 folds
summary(D.auto.reps, mse ~ rep, include="folds")
summary(D.auto.reps, mse ~ rep, fun=sd, include="folds")

data("Mroz", package="carData")
m.mroz <- glm(lfp ~ ., data=Mroz, family=binomial)
cv(m.mroz, criterion=BayesRule, seed=123)

data("Duncan", package="carData")
m.lm <- lm(prestige ~ income + education, data=Duncan)
m.rlm <- MASS::rlm(prestige ~ income + education,
                   data=Duncan)
cv(m.lm, k="loo", method="Woodbury")
cv(m.rlm, k="loo")

[Package cv version 2.0.0 Index]