cv_spatial {blockCV} | R Documentation |
Use spatial blocks to separate train and test folds
Description
This function creates spatially separated folds based on a distance to number of row and/or column.
It assigns blocks to the training and testing folds randomly, systematically or
in a checkerboard pattern. The distance (size
)
should be in metres, regardless of the unit of the reference system of
the input data (for more information see the details section). By default,
the function creates blocks according to the extent and shape of the spatial sample data (x
e.g.
the species occurrence), Alternatively, blocks can be created based on r
assuming that the
user has considered the landscape for the given species and case study.
Blocks can also be offset so the origin is not at the outer corner of the rasters.
Instead of providing a distance, the blocks can also be created by specifying a number of rows and/or
columns and divide the study area into vertical or horizontal bins, as presented in Wenger & Olden (2012)
and Bahn & McGill (2012). Finally, the blocks can be specified by a user-defined spatial polygon layer.
Usage
cv_spatial(
x,
column = NULL,
r = NULL,
k = 5L,
hexagon = TRUE,
flat_top = FALSE,
size = NULL,
rows_cols = c(10, 10),
selection = "random",
iteration = 100L,
user_blocks = NULL,
folds_column = NULL,
deg_to_metre = 111325,
biomod2 = TRUE,
offset = c(0, 0),
extend = 0,
seed = NULL,
progress = TRUE,
report = TRUE,
plot = TRUE,
...
)
Arguments
x |
a simple features (sf) or SpatialPoints object of spatial sample data (e.g., species data or ground truth sample for image classification). |
column |
character (optional). Indicating the name of the column in which response variable (e.g. species data as a binary
response i.e. 0s and 1s) is stored to find balanced records in cross-validation folds. If |
r |
a terra SpatRaster object (optional). If provided, its extent will be used to specify the blocks. It also supports stars, raster, or path to a raster file on disk. |
k |
integer value. The number of desired folds for cross-validation. The default is |
hexagon |
logical. Creates hexagonal (default) spatial blocks. If |
flat_top |
logical. Creating hexagonal blocks with topped flat. |
size |
numeric value of the specified range by which blocks are created and training/testing data are separated.
This distance should be in metres. The range could be explored by |
rows_cols |
integer vector. Two integers to define the blocks based on row and
column e.g. |
selection |
type of assignment of blocks into folds. Can be random (default), systematic, checkerboard, or predefined.
The checkerboard does not work with hexagonal and user-defined spatial blocks. If the |
iteration |
integer value. The number of attempts to create folds with balanced records. Only works when |
user_blocks |
an sf or SpatialPolygons object to be used as the blocks (optional). This can be a user defined polygon and it must cover all
the species (response) points. If |
folds_column |
character. Indicating the name of the column (in |
deg_to_metre |
integer. The conversion rate of metres to degree. See the details section for more information. |
biomod2 |
logical. Creates a matrix of folds that can be directly used in the biomod2 package as a CV.user.table for cross-validation. |
offset |
two number between 0 and 1 to shift blocks by that proportion of block size.
This option only works when |
extend |
numeric; This parameter specifies the percentage by which the map's extent is expanded to increase the size of the square spatial blocks, ensuring that all points fall within a block. The value should be a numeric between 0 and 5. |
seed |
integer; a random seed for reproducibility (although an external seed should also work). |
progress |
logical; whether to shows a progress bar for random fold selection. |
report |
logical; whether to print the report of the records per fold. |
plot |
logical; whether to plot the final blocks with fold numbers in ggplot.
You can re-create this with |
... |
additional option for |
Details
To maintain consistency, all functions in this package use meters as their unit of
measurement. However, when the input map has a geographic coordinate system (in decimal degrees),
the block size is calculated by dividing the size
parameter by deg_to_metre
(which
defaults to 111325 meters, the standard distance of one degree of latitude on the Equator).
In reality, this value varies by a factor of the cosine of the latitude. So, an alternative sensible
value could be cos(mean(sf::st_bbox(x)[c(2,4)]) * pi/180) * 111325
.
The offset
can be used to change the spatial position of the blocks. It can also be used to
assess the sensitivity of analysis results to shifting in the blocking arrangements.
These options are available when size
is defined. By default the region is
located in the middle of the blocks and by setting the offsets, the blocks will shift.
Roberts et. al. (2017) suggest that blocks should be substantially bigger than the range of spatial
autocorrelation (in model residual) to obtain realistic error estimates, while a buffer with the size of
the spatial autocorrelation range would result in a good estimation of error. This is because of the so-called
edge effect (O'Sullivan & Unwin, 2014), whereby points located on the edges of the blocks of opposite sets are
not separated spatially. Blocking with a buffering strategy overcomes this issue (see cv_buffer
).
Value
An object of class S3. A list of objects including:
folds_list - a list containing the folds. Each fold has two vectors with the training (first) and testing (second) indices
folds_ids - a vector of values indicating the number of the fold for each observation (each number corresponds to the same point in species data)
biomod_table - a matrix with the folds to be used in biomod2 package
k - number of the folds
size - input size, if not null
column - the name of the column if provided
blocks - spatial polygon of the blocks
records - a table with the number of points in each category of training and testing
References
Bahn, V., & McGill, B. J. (2012). Testing the predictive performance of distribution models. Oikos, 122(3), 321-331.
O'Sullivan, D., Unwin, D.J., (2010). Geographic Information Analysis, 2nd ed. John Wiley & Sons.
Roberts et al., (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography. 40: 913-929.
Wenger, S.J., Olden, J.D., (2012). Assessing transferability of ecological models: an underappreciated aspect of statistical validation. Methods Ecol. Evol. 3, 260-267.
See Also
cv_buffer
and cv_cluster
; cv_spatial_autocor
and cv_block_size
for selecting block size
For CV.user.table see BIOMOD_Modeling
in biomod2 package
Examples
library(blockCV)
# import presence-absence species data
points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV"))
# make an sf object from data.frame
pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845)
# hexagonal spatial blocking by specified size and random assignment
sb1 <- cv_spatial(x = pa_data,
column = "occ",
size = 450000,
k = 5,
selection = "random",
iteration = 50)
# spatial blocking by row/column and systematic fold assignment
sb2 <- cv_spatial(x = pa_data,
column = "occ",
rows_cols = c(8, 10),
k = 5,
hexagon = FALSE,
selection = "systematic")