cvFit {cvTools} | R Documentation |
Cross-validation for model evaluation
Description
Estimate the prediction error of a model via (repeated) K
-fold
cross-validation. It is thereby possible to supply an object returned by a
model fitting function, a model fitting function itself, or an unevaluated
function call to a model fitting function.
Usage
cvFit(object, ...)
## Default S3 method:
cvFit(
object,
data = NULL,
x = NULL,
y,
cost = rmspe,
K = 5,
R = 1,
foldType = c("random", "consecutive", "interleaved"),
grouping = NULL,
folds = NULL,
names = NULL,
predictArgs = list(),
costArgs = list(),
envir = parent.frame(),
seed = NULL,
...
)
## S3 method for class ''function''
cvFit(
object,
formula,
data = NULL,
x = NULL,
y,
args = list(),
cost = rmspe,
K = 5,
R = 1,
foldType = c("random", "consecutive", "interleaved"),
grouping = NULL,
folds = NULL,
names = NULL,
predictArgs = list(),
costArgs = list(),
envir = parent.frame(),
seed = NULL,
...
)
## S3 method for class 'call'
cvFit(
object,
data = NULL,
x = NULL,
y,
cost = rmspe,
K = 5,
R = 1,
foldType = c("random", "consecutive", "interleaved"),
grouping = NULL,
folds = NULL,
names = NULL,
predictArgs = list(),
costArgs = list(),
envir = parent.frame(),
seed = NULL,
...
)
Arguments
object |
the fitted model for which to estimate the prediction error,
a function for fitting a model, or an unevaluated function call for fitting
a model (see |
... |
additional arguments to be passed down. |
data |
a data frame containing the variables required for fitting the
models. This is typically used if the model in the function call is
described by a |
x |
a numeric matrix containing the predictor variables. This is typically used if the function call for fitting the models requires the predictor matrix and the response to be supplied as separate arguments. |
y |
a numeric vector or matrix containing the response. |
cost |
a cost function measuring prediction loss. It should expect
the observed values of the response to be passed as the first argument and
the predicted values as the second argument, and must return either a
non-negative scalar value, or a list with the first component containing
the prediction error and the second component containing the standard
error. The default is to use the root mean squared prediction error
(see |
K |
an integer giving the number of folds into which the data should
be split (the default is five). Keep in mind that this should be chosen
such that all folds are of approximately equal size. Setting |
R |
an integer giving the number of replications for repeated
|
foldType |
a character string specifying the type of folds to be
generated. Possible values are |
grouping |
a factor specifying groups of observations. If supplied, the data are split according to the groups rather than individual observations such that all observations within a group belong to the same fold. |
folds |
an object of class |
names |
an optional character vector giving names for the arguments containing the data to be used in the function call (see “Details”). |
predictArgs |
a list of additional arguments to be passed to the
|
costArgs |
a list of additional arguments to be passed to the
prediction loss function |
envir |
the |
seed |
optional initial seed for the random number generator (see
|
formula |
a |
args |
a list of additional arguments to be passed to the model fitting function. |
Details
(Repeated) K
-fold cross-validation is performed in the following
way. The data are first split into K
previously obtained blocks of
approximately equal size. Each of the K
data blocks is left out once
to fit the model, and predictions are computed for the observations in the
left-out block with the predict
method of the fitted
model. Thus a prediction is obtained for each observation.
The response variable and the obtained predictions for all observations are
then passed to the prediction loss function cost
to estimate the
prediction error. For repeated cross-validation, this process is replicated
and the estimated prediction errors from all replications as well as their
average are included in the returned object.
Furthermore, if the response is a vector but the
predict
method of the fitted models returns a matrix,
the prediction error is computed for each column. A typical use case for
this behavior would be if the predict
method returns
predictions from an initial model fit and stepwise improvements thereof.
If formula
or data
are supplied, all variables required for
fitting the models are added as one argument to the function call, which is
the typical behavior of model fitting functions with a
formula
interface. In this case, the accepted values
for names
depend on the method. For the function
method, a
character vector of length two should supplied, with the first element
specifying the argument name for the formula and the second element
specifying the argument name for the data (the default is to use
c("formula", "data")
). Note that names for both arguments should be
supplied even if only one is actually used. For the other methods, which do
not have a formula
argument, a character string specifying the
argument name for the data should be supplied (the default is to use
"data"
).
If x
is supplied, on the other hand, the predictor matrix and the
response are added as separate arguments to the function call. In this
case, names
should be a character vector of length two, with the
first element specifying the argument name for the predictor matrix and the
second element specifying the argument name for the response (the default is
to use c("x", "y")
). It should be noted that the formula
or
data
arguments take precedence over x
.
Value
An object of class "cv"
with the following components:
n |
an integer giving the number of observations or groups. |
K |
an integer giving the number of folds. |
R |
an integer giving the number of replications. |
cv |
a numeric vector containing the respective estimated prediction errors. For repeated cross-validation, those are average values over all replications. |
se |
a numeric vector containing the respective estimated standard errors of the prediction loss. |
reps |
a numeric matrix in which each column contains the respective estimated prediction errors from all replications. This is only returned for repeated cross-validation. |
seed |
the seed of the random number generator before cross-validation was performed. |
call |
the matched function call. |
Author(s)
Andreas Alfons
See Also
cvTool
, cvSelect
,
cvTuning
, cvFolds
, cost
Examples
library("robustbase")
data("coleman")
## via model fit
# fit an MM regression model
fit <- lmrob(Y ~ ., data=coleman)
# perform cross-validation
cvFit(fit, data = coleman, y = coleman$Y, cost = rtmspe,
K = 5, R = 10, costArgs = list(trim = 0.1), seed = 1234)
## via model fitting function
# perform cross-validation
# note that the response is extracted from 'data' in
# this example and does not have to be supplied
cvFit(lmrob, formula = Y ~ ., data = coleman, cost = rtmspe,
K = 5, R = 10, costArgs = list(trim = 0.1), seed = 1234)
## via function call
# set up function call
call <- call("lmrob", formula = Y ~ .)
# perform cross-validation
cvFit(call, data = coleman, y = coleman$Y, cost = rtmspe,
K = 5, R = 10, costArgs = list(trim = 0.1), seed = 1234)