cv {cv}R Documentation

Cross-Validate Regression Models

Description

cv() is a parallelized generic k-fold (including n-fold, i.e., leave-one-out) cross-validation function, with a default method, specific methods for linear and generalized-linear models that can be much more computationally efficient, and a method for robust linear models. There are also cv() methods for mixed-effects models, for model-selection procedures, and for several models fit to the same data, which are documented separately.

Usage

cv(model, data, criterion, k, reps = 1L, seed, ...)

## Default S3 method:
cv(
  model,
  data = insight::get_data(model),
  criterion = mse,
  k = 10L,
  reps = 1L,
  seed = NULL,
  criterion.name = deparse(substitute(criterion)),
  details = k <= 10L,
  confint = n >= 400L,
  level = 0.95,
  ncores = 1L,
  type = "response",
  start = FALSE,
  model.function,
  ...
)

## S3 method for class 'lm'
cv(
  model,
  data = insight::get_data(model),
  criterion = mse,
  k = 10L,
  reps = 1L,
  seed = NULL,
  details = k <= 10L,
  confint = n >= 400L,
  level = 0.95,
  method = c("auto", "hatvalues", "Woodbury", "naive"),
  ncores = 1L,
  ...
)

## S3 method for class 'glm'
cv(
  model,
  data = insight::get_data(model),
  criterion = mse,
  k = 10L,
  reps = 1L,
  seed = NULL,
  details = k <= 10L,
  confint = n >= 400L,
  level = 0.95,
  method = c("exact", "hatvalues", "Woodbury"),
  ncores = 1L,
  start = FALSE,
  ...
)

## S3 method for class 'rlm'
cv(model, data, criterion, k, reps = 1L, seed, ...)

## S3 method for class 'cv'
print(x, digits = getOption("digits"), ...)

## S3 method for class 'cvList'
print(x, ...)

## S3 method for class 'cv'
as.data.frame(
  x,
  row.names = NULL,
  optional = TRUE,
  rows = c("cv", "folds"),
  columns = c("criteria", "coefficients"),
  ...
)

## S3 method for class 'cvList'
as.data.frame(x, row.names = NULL, optional = TRUE, ...)

## S3 method for class 'cvDataFrame'
print(x, digits = getOption("digits") - 2L, ...)

## S3 method for class 'cvDataFrame'
summary(
  object,
  formula,
  subset = NULL,
  fun = mean,
  include = c("cv", "folds", "all"),
  ...
)

Arguments

model

a regression model object (see Details).

data

data frame to which the model was fit (not usually necessary).

criterion

cross-validation criterion ("cost" or lack-of-fit) function of form f(y, yhat) where y is the observed values of the response and yhat the predicted values; the default is mse (the mean-squared error).

k

perform k-fold cross-validation (default is 10); k may be a number or "loo" or "n" for n-fold (leave-one-out) cross-validation.

reps

number of times to replicate k-fold CV (default is 1).

seed

for R's random number generator; optional, if not supplied a random seed will be selected and saved; not needed for n-fold cross-validation.

...

to match generic; passed to predict() for the default cv() method; passed to the Tapply() function in the car package for summary.cvDataFrame().

criterion.name

a character string giving the name of the CV criterion function in the returned "cv" object (not usually needed).

details

if TRUE (the default if the number of folds k <= 10), save detailed information about the value of the CV criterion for the cases in each fold and the regression coefficients with that fold deleted.

confint

if TRUE (the default if the number of cases is 400 or greater), compute a confidence interval for the bias-corrected CV criterion, if the criterion is the average of casewise components.

level

confidence level (default 0.95).

ncores

number of cores to use for parallel computations (default is 1, i.e., computations aren't done in parallel).

type

for the default method, value to be passed to the type argument of predict(); the default is type="response", which is appropriate, e.g., for a "glm" model and may be recognized or ignored by predict() methods for other model classes.

start

if TRUE (the default is FALSE), the start argument to update() is set to the vector of regression coefficients for the model fit to the full data, possibly making the CV updates faster, e.g., for a GLM.

model.function

a regression function, typically for a new cv() method that that calls cv.default() via NextMethod(), residing in a package that's not a declared dependency of the cv package, e.g., nnet::multinom. It's usually not necessary to specify model.function to make cv.default() work.

method

computational method to apply to a linear (i.e., "lm") model or to a generalized linear (i.e., "glm") model. See Details for an explanation of the available options.

x

a "cv", "cvList", or "cvDataFrame" object to be printed or coerced to a data frame.

digits

significant digits for printing, default taken from the "digits" option.

row.names

optional row names for the result, defaults to NULL.

optional

to match the as.data.frame() generic function; if FALSE (the default is TRUE), then the names of the columns of the returned data frame, including the names of coefficients, are coerced to syntactically correct names.

rows

the rows of the resulting data frame to retain: setting rows="cv" retains rows pertaining to the overall CV result (marked as "fold 0" ); setting rows="folds" retains rows pertaining to individual folds 1 through k; the default is rows = c("cv", "folds"), which retains all rows.

columns

the columns of the resulting data frame to retain: setting columns="critera" retains columns pertaining to CV criteria; setting columns="coefficients" retains columns pertaining to model coefficients (broadly construed); the default is columns = c("criteria", "coefficients"), which retains both; and the columns "model", "rep", and "fold", if present, are always retained.

object

an object inheriting from "cvDataFrame" to summarize.

formula

of the form some.criterion ~ classifying.variable(s) (see examples).

subset

a subsetting expression; the default (NULL) is not to subset the "cvDataFrame" object.

fun

summary function to apply, defaulting to mean.

include

which rows of the "cvDataFrame" to include in the summary. One of "cv" (the default), rows representing the overall CV results; "folds", rows for individual folds; "all", all rows (generally not sensible).

Details

The default cv() method uses update() to refit the model to each fold, and should work if there are appropriate update() and predict() methods, and if the default method for GetResponse() works or if a GetResponse() method is supplied.

The "lm" and "glm" methods can use much faster computational algorithms, as selected by the method argument. The linear-model method accommodates weighted linear models.

For both classes of models, for the leave-one-out (n-fold) case, fitted values for the folds can be computed from the hat-values via method="hatvalues" without refitting the model; for GLMs, this method is approximate, for LMs it is exact.

Again for both classes of models, when more than one case is omitted in each fold, fitted values may be obtained without refitting the model by exploiting the Woodbury matrix identity via method="Woodbury". As for hatvalues, this method is exact for LMs and approximate for GLMs.

The default for linear models is method="auto", which is equivalent to method="hatvalues" for n-fold cross-validation and method="Woodbury" otherwise; method="naive" refits the model via update() and is generally much slower. The default for generalized linear models is method="exact", which employs update(). This default is conservative, and it is usually safe to use method="hatvalues" for n-fold CV or method="Woodbury" for k-fold CV.

There is also a method for robust linear models fit by rlm() in the MASS package (to avoid inheriting the "lm" method for which the default "auto" computational method would be inappropriate).

For additional details, see the "Cross-validating regression models" vignette (vignette("cv", package="cv")).

cv() is designed to be extensible to other classes of regression models; see the "Extending the cv package" vignette (vignette("cv-extend", package="cv")).

Value

The cv() methods return an object of class "cv", with the CV criterion ("CV crit"), the bias-adjusted CV criterion ("adj CV crit"), the criterion for the model applied to the full data ("full crit"), the confidence interval and level for the bias-adjusted CV criterion ("confint"), the number of folds ("k"), and the seed for R's random-number generator ("seed"). If details=TRUE, then the returned object will also include a "details" component, which is a list of two elements: "criterion", containing the CV criterion computed for the cases in each fold; and "coefficients", regression coefficients computed for the model with each fold deleted. Some methods may return a subset of these components and may add additional information. If reps > 1, then an object of class "cvList" is returned, which is literally a list of "cv" objects.

Methods (by class)

Methods (by generic)

Functions

See Also

cv.merMod, cv.function, cv.modList.

Examples

data("Auto", package="ISLR2")
m.auto <- lm(mpg ~ horsepower, data=Auto)
cv(m.auto,  k="loo")
(cv.auto <- cv(m.auto, seed=1234))
compareFolds(cv.auto)
(cv.auto.reps <- cv(m.auto, seed=1234, reps=3))
D.auto.reps <- as.data.frame(cv.auto.reps)
head(D.auto.reps)
summary(D.auto.reps, mse ~ rep + fold, include="folds")
summary(D.auto.reps, mse ~ rep + fold, include = "folds",
        subset = fold <= 5) # first 5 folds
summary(D.auto.reps, mse ~ rep, include="folds")
summary(D.auto.reps, mse ~ rep, fun=sd, include="folds")

data("Mroz", package="carData")
m.mroz <- glm(lfp ~ ., data=Mroz, family=binomial)
cv(m.mroz, criterion=BayesRule, seed=123)

data("Duncan", package="carData")
m.lm <- lm(prestige ~ income + education, data=Duncan)
m.rlm <- MASS::rlm(prestige ~ income + education,
                   data=Duncan)
cv(m.lm, k="loo", method="Woodbury")
cv(m.rlm, k="loo")


[Package cv version 2.0.0 Index]