R: Cross-Validating a Spatial Linear Model Fitted by 'georob'

cv.georob {georob}

R Documentation

Cross-Validating a Spatial Linear Model Fitted by `georob`

Description

This function assesses the goodness-of-fit of a spatial linear model by K-fold cross-validation. In more detail, the model is re-fitted K times by robust (or Gaussian) (RE)ML, excluding each time 1/Kth of the data. The re-fitted models are used to compute robust (or customary) external Kriging predictions for the omitted observations. If the response variable is log-transformed then the Kriging predictions can be optionally transformed back to the original scale of the measurements. S3methods for evaluating and plotting diagnostic summaries of the cross-validation errors are described for the function validate.predictions.

Usage

## S3 method for class 'georob'
cv(object, formula = NULL, subset = NULL,
    method = c("block", "random"), nset = 10L, seed = NULL,
    sets = NULL, duplicates.in.same.set = TRUE, re.estimate = TRUE,
    param = object[["variogram.object"]][[1]][["param"]],
    fit.param = object[["variogram.object"]][[1]][["fit.param"]],
    aniso = object[["variogram.object"]][[1]][["aniso"]],
    fit.aniso = object[["variogram.object"]][[1]][["fit.aniso"]],
    variogram.object = NULL,
    use.fitted.param = TRUE, return.fit = FALSE,
    reduced.output = TRUE, lgn = FALSE,
    mfl.action = c("offset", "stop"),
    ncores = min(nset, parallel::detectCores()), verbose = 0, ...)

Arguments

`object`	an object of class of `"georob"`, see `georobObject`.
`formula`	an optional formula for the regression model passed by `update` to `georob`.
`subset`	an optional vector specifying a subset of observations to be used in the fitting process.
`method`	a character keyword, controlling whether subsets are formed by partitioning data set into contiguous spatial `block`s by `kmeans` (default) or `random`ly. Ignored if `sets` is non-`NULL`.
`nset`	a positive integer defining the number `K` of subsets into which the data set is partitioned (default: `nset = 10`). Ignored if `sets` is non-`NULL`.
`seed`	an optional integer seed to initialize random number generation, see `set.seed`. Ignored if `sets` is non-`NULL`.
`sets`	an optional vector of the same length as the response vector of the fitted model and with positive integers taking values in `(1,2,\ldots,K)`, defining in this way the `K` subsets into which the data set is split. If `sets = NULL` (default) the partition is randomly generated by `kmeans` or `runif` (using possibly `seed`).
`duplicates.in.same.set`	a logical scalar controlling whether replicated observations at a given location are assigned to the same subset when partitioning the data (default `TRUE`).
`re.estimate`	a logical scalar controlling whether the model is re-fitted to the reduced data sets before computing the Kriging predictions (`TRUE`, default) or whether the model passed in `object` is used to compute the predictions for the omitted observations, see Details.
`param`	a named numeric vector or a matrix or data frame with initial values of variogram parameters passed by `update` to `georob`. If `param` is a matrix (or a data frame) then it must have `nset` rows and `length(object[["variogram.object"]][[1]][["param"]])` columns with initial values of variogram parameters for the `nset` cross-validation sets, and `colnames(param)` must match `names(object[["variogram.object"]][[1]][["param"]])`.
`fit.param`	a named logical vector or a matrix or data frame defining which variogram parameters should be adjusted by `update`. If `fit.param` is a matrix (or a data frame) then it must have `nset` rows and `length(object[["variogram.object"]][[1]][["fit.param"]])` columns with variogram parameter fitting flags for the `nset` cross-validation sets, and `colnames(param)` must match `names(object[["variogram.object"]][[1]][["fit.param"]])`.
`aniso`	a named numeric vector or a matrix or data frame with initial values of anisotropy parameters passed by `update` to `georob`. If `aniso` is a matrix (or a data frame) then it must have `nset` rows and `length(object[["variogram.object"]][[1]][["aniso"]])` columns with initial values of anisotropy parameters for the `nset` cross-validation sets, and `colnames(aniso)` must match `names(object[["variogram.object"]][[1]][["aniso"]])`.
`fit.aniso`	a named logical vector or a matrix or data frame defining which anisotropy parameters should be adjusted by `update`. If `fit.aniso` is a matrix (or a data frame) then it must have `nset` rows and `length(object[["variogram.object"]][[1]][["fit.aniso"]])` columns with anisotropy parameter fitting flags for the `nset` cross-validation sets, and `colnames(param)` must match `names(object[["variogram.object"]][[1]][["fit.aniso"]])`.
`variogram.object`	an optional list that gives initial values for fitting a possibly nested variogram model for the cross-validation sets. Each component is a list with the following components: `param`: an optional named numeric vector or a matrix or data frame with initial values of variogram parameters passed by `update` to `georob`. If `param` is a matrix (or a data frame) then it must have `nset` rows and `length(object[["variogram.object"]][[i]][["param"]])` columns with initial values of variogram parameters for the `nset` cross-validation sets (`i` is the `i`th variogram structure), and `colnames(param)` must match `names(object[["variogram.object"]][[i]][["param"]])`. `fit.param`: an optional named logical vector or a matrix or data frame defining which variogram parameters should be adjusted by `update`. If `fit.param` is a matrix (or a data frame) then it must have `nset` rows and `length(object[["variogram.object"]][[i]][["fit.param"]])` columns with variogram parameter fitting flags for the `nset` cross-validation sets (`i` is the `i`th variogram structure), and `colnames(param)` must match `names(object[["variogram.object"]][[i]][["fit.param"]])`. `aniso`: an optional named numeric vector or a matrix or data frame with initial values of anisotropy parameters passed by `update` to `georob`. If `aniso` is a matrix (or a data frame) then it must have `nset` rows and `length(object[["variogram.object"]][[i]][["aniso"]])` columns with initial values of anisotropy parameters for the `nset` cross-validation sets (`i` is the `i`th variogram structure), and `colnames(aniso)` must match `names(object[["variogram.object"]][[i]][["aniso"]])`. `fit.aniso`: an optional named logical vector or a matrix or data frame defining which anisotropy parameters should be adjusted by `update`. If `fit.aniso` is a matrix (or a data frame) then it must have `nset` rows and `length(object[["variogram.object"]][[i]][["fit.aniso"]])` columns with anisotropy parameter fitting flags for the `nset` cross-validation sets(`i` is the `i`th variogram structure), and `colnames(param)` must match `names(object[["variogram.object"]][[i]][["fit.aniso"]])`.
`use.fitted.param`	a logical scalar controlling whether fitted values of `param` (and `aniso` are used as initial values when variogram parameters are fitted for the cross-validation sets (default `TRUE`).
`return.fit`	a logical scalar controlling whether information about the fit should be returned when re-estimating the model with the reduced data sets (default `FALSE`).
`reduced.output`	a logical scalar controlling whether the complete fitted model objects, fitted to the reduced data sets, are returned (`FALSE`) or only some components (`TRUE`, default, see Value). Ignored if `return.fit = FALSE`.
`lgn`	a logical scalar controlling whether Kriging predictions of a log-transformed response should be transformed back to the original scale of the measurements (default `FALSE`).
`mfl.action`	a character keyword controlling what is done when some levels of factor(s) are not present in any of the subsets used to fit the model. The function either stops (`"stop"`) or treats the respective factors as model offset (`"offset"`, default).
`ncores`	a positive integer controlling how many cores are used for parallelized computations, see Details.
`verbose`	a positive integer controlling logging of diagnostic messages to the console during model fitting. Passed by `update` to `georob`.
`...`	additional arguments passed by `update` to `georob`, see Details.

Details

Note that the data frame passed as data argument to georob must exist in the user workspace when calling cv.georob.

cv.georob uses the packages parallel and snowfall for parallelized computations. By default, the function uses K CPUs but not more than are physically available (as returned by detectCores).

cv.georob uses the function update to re-estimated the model with the reduced data sets. Therefore, any argument accepted by georob except data can be changed when re-fitting the model. Some of them (e.g. formula, subset, etc.) are explicit arguments of cv.georob, but also the remaining ones can be passed by ... to the function.

Practitioners in geostatistics commonly cross-validate a fitted model without re-estimating the model parameters with the reduced data sets. This is clearly an unsound practice (see Hastie et al., 2009, sec. 7.10). Therefore, the argument re.estimate should always be set to TRUE. The alternative is provided only for historic reasons.

Value

The method cv.georob returns an object of class cv.georob, which is a list with the two components pred and fit.

pred is a data frame with the coordinates and the cross-validation prediction results with the following variables:

`subset`	an integer vector defining to which of the `K` subsets an observation was assigned.
`data`	the values of the (possibly log-transformed) response.
`pred`	the Kriging predictions.
`se`	the Kriging standard errors.

If lgn = TRUE then pred has the additional variables:

`lgn.data`	the untransformed response.
`lgn.pred`	the unbiased back-transformed predictions of a log-transformed response.
`lgn.se`	the Kriging standard errors of the back-transformed predictions of a log-transformed response.

The second component fit contains either the full outputs of georob, fitted for the K reduced data sets (reduced.output = FALSE), or K lists with the components tuning.psi, converged,
convergence.code, gradient, variogram.object, coefficients along with the standard errors of \widehat{\boldsymbol{\beta}}, see georobObject.

Author(s)

Andreas Papritz papritz@retired.ethz.ch.

References

Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning; Data Mining, Inference and Prediction, Springer, New York, doi:10.1007/978-0-387-84858-7

Examples

## define number of cores for parallel computations
if(interactive()) ncpu <- 10L else ncpu <- 1L

data(meuse)

r.logzn <- georob(log(zinc) ~ sqrt(dist), data = meuse, locations = ~ x + y,
  variogram.model = "RMexp",
  param = c(variance = 0.15, nugget = 0.05, scale = 200),
  tuning.psi = 1000)

if(interactive()){
  ## example is run only in interactive session because cpu times exceeds 5 s
  r.logzn.cv.1 <- cv(r.logzn, seed = 1, lgn = TRUE, ncores = 1, verbose = 1)

  r.logzn.cv.2 <- cv(r.logzn, formula = .~. + ffreq, seed = 1, lgn = TRUE,
      ncores = ncpu)

  plot(r.logzn.cv.1, type = "bs")
  plot(r.logzn.cv.2, type = "bs", add = TRUE, col = "red")

  legend("topright", lty = 1, col = c("black", "red"), bty = "n",
      legend = c("log(Zn) ~ sqrt(dist)", "log(Zn) ~ sqrt(dist) + ffreq"))
}

[Package georob version 0.3-19 Index]

Cross-Validating a Spatial Linear Model Fitted by georob