R: Partition the data for a stratified (non-spatial)...

partition_cv_strat {sperrorest}

R Documentation

Partition the data for a stratified (non-spatial) cross-validation

Description

partition_cv_strat creates a set of sample indices corresponding to cross-validation test and training sets.

Usage

partition_cv_strat(
  data,
  coords = c("x", "y"),
  nfold = 10,
  return_factor = FALSE,
  repetition = 1,
  seed1 = NULL,
  strat
)

Arguments

`data`	`data.frame` containing at least the columns specified by `coords`
`coords`	vector of length 2 defining the variables in `data` that contain the x and y coordinates of sample locations
`nfold`	number of partitions (folds) in `nfold`-fold cross-validation partitioning
`return_factor`	if `FALSE` (default), return a represampling object; if `TRUE` (used internally by other sperrorest functions), return a `list` containing factor vectors (see Value)
`repetition`	numeric vector: cross-validation repetitions to be generated. Note that this is not the number of repetitions, but the indices of these repetitions. E.g., use `repetition = c(1:100)` to obtain (the 'first') 100 repetitions, and `repetition = c(101:200)` to obtain a different set of 100 repetitions.
`seed1`	`seed1+i` is the random seed that will be used by set.seed in repetition `i` (`i` in `repetition`) to initialize the random number generator before sampling from the data set.
`strat`	character: column in `data` containing a factor variable over which the partitioning should be stratified; or factor vector of length `nrow(data)`: variable over which to stratify

Value

A represampling object, see also partition_cv(). partition_strat_cv, however, stratified with respect to the variable data[,strat]; i.e., cross-validation partitioning is done within each set data[data[,strat]==i,] (i in levels(data[, strat])), and the ith folds of all levels are combined into one cross-validation fold.

Examples

data(ecuador)
parti <- partition_cv_strat(ecuador,
  strat = "slides", nfold = 5,
  repetition = 1
)
idx <- parti[["1"]][[1]]$train
mean(ecuador$slides[idx] == "TRUE") / mean(ecuador$slides == "TRUE")
# always == 1
# Non-stratified cross-validation:
parti <- partition_cv(ecuador, nfold = 5, repetition = 1)
idx <- parti[["1"]][[1]]$train
mean(ecuador$slides[idx] == "TRUE") / mean(ecuador$slides == "TRUE")
# close to 1 because of large sample size, but with some random variation