cv.function {cv}R Documentation

Cross-Validate a Model-Selection Procedure

Description

The cv() "function" method is a general function to cross-validate a model-selection procedure, such as the following: selectStepAIC() is a procedure that applies the stepAIC() model-selection function in the MASS package; selectTrans() is a procedure for selecting predictor and response transformations in regression, which uses the powerTransform() function in the car package; selectTransAndStepAIC() combines predictor and response transformations with predictor selection; and selectModelList() uses cross-validation to select a model from a list of models created by models() and employs (recursive) cross-validation to assess the predictive accuracy of this procedure.

Usage

## S3 method for class ''function''
cv(
  model,
  data,
  criterion = mse,
  k = 10L,
  reps = 1L,
  seed = NULL,
  working.model = NULL,
  y.expression = NULL,
  confint = n >= 400L,
  level = 0.95,
  details = k <= 10L,
  save.model = FALSE,
  ncores = 1L,
  ...
)

selectStepAIC(
  data,
  indices,
  model,
  criterion = mse,
  AIC = TRUE,
  details = TRUE,
  save.model = FALSE,
  ...
)

selectTrans(
  data,
  indices,
  details = TRUE,
  save.model = FALSE,
  model,
  criterion = mse,
  predictors,
  response,
  family = c("bcPower", "bcnPower", "yjPower", "basicPower"),
  family.y = c("bcPower", "bcnPower", "yjPower", "basicPower"),
  rounded = TRUE,
  ...
)

selectTransStepAIC(
  data,
  indices,
  details = TRUE,
  save.model = FALSE,
  model,
  criterion = mse,
  predictors,
  response,
  family = c("bcPower", "bcnPower", "yjPower", "basicPower"),
  family.y = c("bcPower", "bcnPower", "yjPower", "basicPower"),
  rounded = TRUE,
  AIC = TRUE,
  ...
)

selectModelList(
  data,
  indices,
  model,
  criterion = mse,
  k = 10L,
  k.recurse = k,
  details = k <= 10L,
  save.model = FALSE,
  seed = FALSE,
  quietly = TRUE,
  ...
)

compareFolds(object, digits = 3, ...)

## S3 method for class 'cvSelect'
coef(object, average, NAs = 0, ...)

Arguments

model

a regression model object fit to data, or for the cv() "function" method, a model-selection procedure function (see Details).

data

full data frame for model selection.

criterion

a CV criterion ("cost" or lack-of-fit) function.

k

perform k-fold cross-validation (default is 10); k may be a number or "loo" or "n" for n-fold (leave-one-out) cross-validation.

reps

number of times to replicate k-fold CV (default is 1)

seed

for R's random number generator; not used for n-fold cross-validation. If not explicitly set, a seed is randomly generated and saved to make the results reproducible. In some cases, for internal use only, seed is set to FALSE to suppress automatically setting the seed.

working.model

a regression model object fit to data, typically to begin a model-selection process; for use with selectModelList(), a list of competing models created by models().

y.expression

normally the response variable is found from the model or working.model argument; but if, for a particular selection procedure, the model or working.model argument is absent, or if the response can't be inferred from the model, the response can be specified by an expression, such as expression(log(income)), to be evaluated within the data set provided by the data argument.

confint

if TRUE (the default if the number of cases is 400 or greater), compute a confidence interval for the bias-corrected CV criterion, if the criterion is the average of casewise components.

level

confidence level (default 0.95).

details

if TRUE, save detailed information about the value of the CV criterion for the cases in each fold and the regression coefficients (and possibly other information) with that fold deleted; default is TRUE if k is 10 or smaller, FALSE otherwise.

save.model

save the model that's selected using the full data set (default, FALSE).

ncores

number of cores to use for parallel computations (default is 1, i.e., computations aren't done in parallel)

...

for cvSelect() and the cv() "function" method, arguments to be passed to procedure(); for selectStepAIC() and selectTransStepAIC(), arguments to be passed to stepAIC().

indices

indices of cases in data defining the current fold.

AIC

if TRUE (the default) use the AIC as the model-selection criterion; if FALSE, use the BIC. The k argument to stepAIC() is set accordingly (note that this is distinct from the number of folds k).

predictors

character vector of names of the predictors in the model to transform; if missing, no predictors will be transformed.

response

name of the response variable; if missing, the response won't be transformed.

family

transformation family for the predictors, one of "bcPower", "bcnPower", "yjPower", "basicPower", with "bcPower" as the default. These are the names of transformation functions in the car package; see bcPower().

family.y

transformation family for the response, with "bcPower" as the default.

rounded

if TRUE (the default) use nicely rounded versions of the estimated transformation parameters (see bcPower()).

k.recurse

the number of folds for recursive CV; defaults to the value of k; may be specified as "loo" or "n" as well as an integer.

quietly

if TRUE (the default), simple messages (for example about the value to which the random-number generator seed is set), but not warnings or errors, are suppressed.

object

an object of class "cvSelect".

digits

significant digits for printing coefficients (default 3).

average

if supplied, a function, such as mean or median, to use us in averaging estimates across folds; if missing, the estimates for each fold are returned.

NAs

values to substitute for NAs in calculating averaged estimates; the default, 0, is appropriate, e.g., for regression coefficients; the value 1 might be appropriate for power-transformation estimates.

Details

The model-selection function supplied as the procedure (for cvSelect()) or model (for cv()) argument should accept the following arguments:

data

set to the data argument to cvSelect() or cv().

indices

the indices of the rows of data defining the current fold; if missing, the model-selection procedure is applied to the full data.

other arguments

to be passed via ... from cvSelect() or cv().

procedure() or model() should return a list with the following named elements: fit.i, the vector of predicted values for the cases in the current fold computed from the model omitting these cases; crit.all.i, the CV criterion computed for all of the cases using the model omitting the current fold; and (optionally) coefficients, parameter estimates from the model computed omitting the current fold.

When the indices argument is missing, procedure() returns the cross-validation criterion for all of the cases based on the model fit to all of the cases.

For examples of model-selection functions for the procedure argument, see the code for selectStepAIC(), selectTrans(), and selectTransAndStepAIC().

For additional information, see the "Cross-validating model selection" vignette (vignette("cv-select", package="cv")) and the "Extending the cv package" vignette (vignette("cv-extend", package="cv")).

Value

An object of class "cvSelect", inheriting from class "cv", with the CV criterion ("CV crit"), the bias-adjusted CV criterion ("adj CV crit"), the criterion for the model applied to the full data ("full crit"), the confidence interval and level for the bias-adjusted CV criterion ("confint"), the number of folds ("k"), the seed for R's random-number generator ("seed"), and (optionally) a list of coefficients (or, in the case of selectTrans(), estimated transformation parameters, and in the case of selectTransAndStepAIC(), both regression coefficients and transformation parameters) for the selected models for each fold ("coefficients"). If reps > 1, then an object of class c("cvSelectList", "cvList") is returned, which is literally a list of c("cvSelect", "cv") objects.

Functions

See Also

stepAIC, bcPower, powerTransform, cv.

Examples

data("Auto", package="ISLR2")
m.auto <- lm(mpg ~ . - name - origin, data=Auto)
cv(selectStepAIC, Auto, seed=123, working.model=m.auto)
cv(selectStepAIC, Auto, seed=123, working.model=m.auto,
         AIC=FALSE, k=5, reps=3) # via BIC
data("Prestige", package="carData")
m.pres <- lm(prestige ~ income + education + women,
             data=Prestige)
cvt <- cv(selectTrans, data=Prestige, working.model=m.pres, seed=123,
          predictors=c("income", "education", "women"),
          response="prestige", family="yjPower")
cvt
compareFolds(cvt)
coef(cvt, average=median, NAs=1) # NAs not really needed here
cv(m.pres, seed=123)
Auto$year <- as.factor(Auto$year)
Auto$origin <- factor(Auto$origin,
                      labels=c("America", "Europe", "Japan"))
rownames(Auto) <- make.names(Auto$name, unique=TRUE)
Auto$name <- NULL
m.auto <- lm(mpg ~ . , data=Auto)
cvs <- cv(selectTransStepAIC, data=Auto, seed=76692, working.model=m.auto,
          criterion=medAbsErr,
          predictors=c("cylinders", "displacement", "horsepower",
                       "weight", "acceleration"),
          response="mpg", AIC=FALSE)
cvs
compareFolds(cvs)
data("Duncan", package="carData")
m1 <- lm(prestige ~ income + education, data=Duncan)
m2 <- lm(prestige ~ income + education + type, data=Duncan)
m3 <- lm(prestige ~ (income + education)*type, data=Duncan)
cv(selectModelList, data=Duncan, seed=5962,
   working.model=models(m1, m2, m3)) # recursive CV


[Package cv version 2.0.0 Index]