R: Spatial Clustering Cross-Validation

spatial_clustering_cv {spatialsample}

R Documentation

Spatial Clustering Cross-Validation

Description

Spatial clustering cross-validation splits the data into V groups of disjointed sets by clustering points based on their spatial coordinates. A resample of the analysis data consists of V-1 of the folds/clusters while the assessment set contains the final fold/cluster.

Usage

spatial_clustering_cv(
  data,
  v = 10,
  cluster_function = c("kmeans", "hclust"),
  radius = NULL,
  buffer = NULL,
  ...,
  repeats = 1,
  distance_function = function(x) as.dist(sf::st_distance(x))
)

Arguments

`data`	An `sf` object (often from `sf::read_sf()` or `sf::st_as_sf()`) to split into folds.
`v`	The number of partitions of the data set.
`cluster_function`	Which function should be used for clustering? Options are either `"kmeans"` (to use `stats::kmeans()`) or `"hclust"` (to use `stats::hclust()`). You can also provide your own function; see `Details`.
`radius`	Numeric: points within this distance of the initially-selected test points will be assigned to the assessment set. If `NULL`, no radius is applied.
`buffer`	Numeric: points within this distance of any point in the test set (after `radius` is applied) will be assigned to neither the analysis or assessment set. If `NULL`, no buffer is applied.
`...`	Extra arguments passed on to `stats::kmeans()` or `stats::hclust()`.
`repeats`	The number of times to repeat the clustered partitioning.
`distance_function`	Which function should be used for distance calculations? Defaults to `sf::st_distance()`, with the output matrix converted to a `stats::dist()` object. You can also provide your own function; see Details.

Details

Clusters are created based on the distances between observations if data is an sf object. Each cluster is used as a fold for cross-validation. Depending on how the data are distributed spatially, there may not be an equal number of observations in each fold.

You can optionally provide a custom function to distance_function. The function should take an sf object and return a stats::dist() object with distances between data points.

You can optionally provide a custom function to cluster_function. The function must take three arguments:

dists, a stats::dist() object with distances between data points
v, a length-1 numeric for the number of folds to create
..., to pass any additional named arguments to your function

The function should return a vector of cluster assignments of length nrow(data), with each element of the vector corresponding to the matching row of the data frame.

Value

A tibble with classes spatial_clustering_cv, spatial_rset, rset, tbl_df, tbl, and data.frame. The results include a column for the data split objects and an identification variable id. Resamples created from non-sf objects will not have the spatial_rset class.

Changes in spatialsample 0.3.0

As of spatialsample version 0.3.0, this function no longer accepts non-sf objects as arguments to data. In order to perform clustering with non-spatial data, consider using rsample::clustering_cv().

Also as of version 0.3.0, this function now calculates edge-to-edge distance for non-point geometries, in line with the rest of the package. Earlier versions relied upon between-centroid distances.

References

A. Brenning, "Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest," 2012 IEEE International Geoscience and Remote Sensing Symposium, Munich, 2012, pp. 5372-5375, doi: 10.1109/IGARSS.2012.6352393.

Examples


data(Smithsonian, package = "modeldata")

smithsonian_sf <- sf::st_as_sf(
  Smithsonian,
  coords = c("longitude", "latitude"),
  # Set CRS to WGS84
  crs = 4326
)

# When providing sf objects, coords are inferred automatically
spatial_clustering_cv(smithsonian_sf, v = 5)

# Can use hclust instead:
spatial_clustering_cv(smithsonian_sf, v = 5, cluster_function = "hclust")

[Package spatialsample version 0.5.1 Index]