R: Perform spatial error estimation and variable importance...

sperrorest {sperrorest}

R Documentation

Perform spatial error estimation and variable importance assessment

Description

sperrorest is a flexible interface for multiple types of parallelized spatial and non-spatial cross-validation and bootstrap error estimation and parallelized permutation-based assessment of spatial variable importance.

Usage

sperrorest(
  formula,
  data,
  coords = c("x", "y"),
  model_fun,
  model_args = list(),
  pred_fun = NULL,
  pred_args = list(),
  smp_fun = partition_cv,
  smp_args = list(),
  train_fun = NULL,
  train_param = NULL,
  test_fun = NULL,
  test_param = NULL,
  err_fun = err_default,
  imp_variables = NULL,
  imp_permutations = 1000,
  imp_sample_from = c("test", "train", "all"),
  importance = !is.null(imp_variables),
  distance = FALSE,
  do_gc = 1,
  progress = "all",
  benchmark = FALSE,
  mode_rep = c("future", "sequential", "loop"),
  mode_fold = c("sequential", "future", "loop"),
  verbose = 0
)

Arguments

`formula`	A formula specifying the variables used by the `model`. Only simple formulas without interactions or nonlinear terms should be used, e.g. `y~x1+x2+x3` but not `y~x1*x2+log(x3)`. Formulas involving interaction and nonlinear terms may possibly work for error estimation but not for variable importance assessment, but should be used with caution. The formula `y~...` is not supported, but `y~1` (i.e. no predictors) is.
`data`	a `data.frame` with predictor and response variables. Training and test samples will be drawn from this data set by `train_fun` and `test_fun`, respectively.
`coords`	vector of length 2 defining the variables in `data` that contain the x and y coordinates of sample locations.
`model_fun`	Function that fits a predictive model, such as `glm` or `rpart`. The function must accept at least two arguments, the first one being a formula and the second a data.frame with the learning sample.
`model_args`	Arguments to be passed to `model_fun` (in addition to the `formula` and `data` argument, which are provided by sperrorest)
`pred_fun`	Prediction function for a fitted model object created by `model`. Must accept at least two arguments: the fitted `object` and a `data.frame` `newdata` with data on which to predict the outcome.
`pred_args`	(optional) Arguments to `pred_fun` (in addition to the fitted model object and the `newdata` argument, which are provided by sperrorest).
`smp_fun`	A function for sampling training and test sets from `data`. E.g. partition_kmeans for spatial cross-validation using spatial k-means clustering.
`smp_args`	(optional) Arguments to be passed to `smp_fun`.
`train_fun`	(optional) A function for resampling or subsampling the training sample in order to achieve, e.g., uniform sample sizes on all training sets, or maintaining a certain ratio of positives and negatives in training sets. E.g. resample_uniform or resample_strat_uniform.
`train_param`	(optional) Arguments to be passed to `resample_fun`.
`test_fun`	(optional) Like `train_fun` but for the test set.
`test_param`	(optional) Arguments to be passed to `test_fun`.
`err_fun`	A function that calculates selected error measures from the known responses in `data` and the model predictions delivered by `pred_fun`. E.g. err_default (the default).
`imp_variables`	(optional; used if `importance = TRUE`). Variables for which permutation-based variable importance assessment is performed. If `importance = TRUE` and `imp_variables` == `NULL`, all variables in `formula` will be used.
`imp_permutations`	(optional; used if `importance = TRUE`). Number of permutations used for variable importance assessment.
`imp_sample_from`	(default: `"test"`): specified if the permuted feature values should be taken from the test set, the training set (a rather unlikely choice), or the entire sample (`"all"`). The latter is useful in leave-one-out resampling situations where the test set is simply too small to perform any kind of resampling. In any case importances are always estimates on the test set. (Note that resampling with replacement is used if the test set is larger than the set from which the permuted values are to be taken.)
`importance`	logical (default: `FALSE`): perform permutation-based variable importance assessment?
`distance`	logical (default: `FALSE`): if `TRUE`, calculate mean nearest-neighbour distances from test samples to training samples using add.distance.represampling.
`do_gc`	numeric (default: 1): defines frequency of memory garbage collection by calling gc; if `⁠< 1⁠`, no garbage collection; if `⁠>= 1⁠`, run a gc after each repetition; if `⁠>= 2⁠`, after each fold.
`progress`	character (default: `all`): Whether to show progress information (if possible). Default shows repetition, fold and (if enabled) variable importance progress. Set to `"rep"` for repetition information only or `FALSE` for no progress information.
`benchmark`	(optional) logical (default: `FALSE`): if `TRUE`, perform benchmarking and return `sperrorestbenchmark` object.
`mode_rep`, `mode_fold`	character (default: `"future"` and `"sequential"`, respectively): specifies whether to parallelize the execution at the repetition level, at the fold level, or not at all. Parallel execution uses `future.apply::future_lapply()` (see details below). It is only possible to parallelize at the repetition level or at the fold level. The `"loop"` option uses a `for` loop instead of an `lappy` function; this option is for debugging purposes.
`verbose`	Controls the amount of information printed while processing. Defaults to 0 (no output).

Details

Custom predict functions passed to pred_fun, which consist of multiple child functions, must be defined in one function.

Value

A list (object of class sperrorest) with (up to) six components:

error_rep: sperrorestreperror containing predictive performances at the repetition level
error_fold: sperroresterror object containing predictive performances at the fold level
represampling: represampling object
importance: sperrorestimportance object containing permutation-based variable importances at the fold level
benchmark: sperrorestbenchmark object containing information on the system the code is running on, starting and finishing times, number of available CPU cores and runtime performance
package_version: sperrorestpackageversion object containing information about the sperrorest package version

Parallelization

Running in parallel is supported via package future. Have a look at vignette("future-1-overview", package = "future"). In short: Choose a backend and specify the number of workers, then call sperrorest() as usual. Example:

future::plan(future.callr::callr, workers = 2)
sperrorest()

Parallelization at the repetition is recommended when using repeated cross-validation. If the 'granularity' of parallelized function calls is too fine, the overall runtime will be very poor since the overhead for passing arguments and handling environments becomes too large. Use fold-level parallelization only when the processing time of individual folds is very large and the number of repetitions is small or equals 1.

Note that nested calls to future are not possible. Therefore a sequential sperrorest call should be used for hyperparameter tuning in a nested cross-validation.

References

Brenning, A. 2012. Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: the R package 'sperrorest'. 2012 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 23-27 July 2012, p. 5372-5375. https://ieeexplore.ieee.org/document/6352393

Brenning, A. 2005. Spatial prediction models for landslide hazards: review, comparison and evaluation. Natural Hazards and Earth System Sciences, 5(6), 853-862. doi:10.5194/nhess-5-853-2005

Brenning, A., S. Long & P. Fieguth. 2012. Detecting rock glacier flow structures using Gabor filters and IKONOS imagery. Remote Sensing of Environment, 125, 227-237. doi:10.1016/j.rse.2012.07.005

Russ, G. & A. Brenning. 2010a. Data mining in precision agriculture: Management of spatial information. In 13th International Conference on Information Processing and Management of Uncertainty, IPMU 2010; Dortmund; 28 June - 2 July 2010. Lecture Notes in Computer Science, 6178 LNAI: 350-359.

Russ, G. & A. Brenning. 2010b. Spatial variable importance assessment for yield prediction in Precision Agriculture. In Advances in Intelligent Data Analysis IX, Proceedings, 9th International Symposium, IDA 2010, Tucson, AZ, USA, 19-21 May 2010. Lecture Notes in Computer Science, 6065 LNCS: 184-195.

Examples


## ------------------------------------------------------------
## Classification tree example using non-spatial partitioning
## ------------------------------------------------------------

# Muenchow et al. (2012), see ?ecuador
fo <- slides ~ dem + slope + hcurv + vcurv + log.carea + cslope

library(rpart)
mypred_part <- function(object, newdata) predict(object, newdata)[, 2]
ctrl <- rpart.control(cp = 0.005) # show the effects of overfitting
# show the effects of overfitting
fit <- rpart(fo, data = ecuador, control = ctrl)

### Non-spatial cross-validation:
mypred_part <- function(object, newdata) predict(object, newdata)[, 2]
nsp_res <- sperrorest(
  data = ecuador, formula = fo,
  model_fun = rpart,
  model_args = list(control = ctrl),
  pred_fun = mypred_part,
  progress = TRUE,
  smp_fun = partition_cv,
  smp_args = list(repetition = 1:2, nfold = 3)
)
summary(nsp_res$error_rep)
summary(nsp_res$error_fold)
summary(nsp_res$represampling)
# plot(nsp_res$represampling, ecuador)

### Spatial cross-validation:
sp_res <- sperrorest(
  data = ecuador, formula = fo,
  model_fun = rpart,
  model_args = list(control = ctrl),
  pred_fun = mypred_part,
  progress = TRUE,
  smp_fun = partition_kmeans,
  smp_args = list(repetition = 1:2, nfold = 3)
)
summary(sp_res$error_rep)
summary(sp_res$error_fold)
summary(sp_res$represampling)
# plot(sp_res$represampling, ecuador)

smry <- data.frame(
  nonspat_training = unlist(summary(nsp_res$error_rep,
    level = 1
  )$train_auroc),
  nonspat_test = unlist(summary(nsp_res$error_rep,
    level = 1
  )$test_auroc),
  spatial_training = unlist(summary(sp_res$error_rep,
    level = 1
  )$train_auroc),
  spatial_test = unlist(summary(sp_res$error_rep,
    level = 1
  )$test_auroc)
)
boxplot(smry,
  col = c("red", "red", "red", "green"),
  main = "Training vs. test, nonspatial vs. spatial",
  ylab = "Area under the ROC curve"
)

[Package sperrorest version 3.0.5 Index]