spatial_clustering_cv {spatialsample} | R Documentation |
Spatial Clustering Cross-Validation
Description
Spatial clustering cross-validation splits the data into V groups of disjointed sets by clustering points based on their spatial coordinates. A resample of the analysis data consists of V-1 of the folds/clusters while the assessment set contains the final fold/cluster.
Usage
spatial_clustering_cv(
data,
v = 10,
cluster_function = c("kmeans", "hclust"),
radius = NULL,
buffer = NULL,
...,
repeats = 1,
distance_function = function(x) as.dist(sf::st_distance(x))
)
Arguments
data |
An |
v |
The number of partitions of the data set. |
cluster_function |
Which function should be used for clustering?
Options are either |
radius |
Numeric: points within this distance of the initially-selected
test points will be assigned to the assessment set. If |
buffer |
Numeric: points within this distance of any point in the
test set (after |
... |
Extra arguments passed on to |
repeats |
The number of times to repeat the clustered partitioning. |
distance_function |
Which function should be used for distance
calculations? Defaults to |
Details
Clusters are created based on the distances between observations
if data
is an sf
object. Each cluster is used as a fold for
cross-validation. Depending on how the data are distributed spatially, there
may not be an equal number of observations in each fold.
You can optionally provide a custom function to distance_function.
The
function should take an sf
object and return a stats::dist()
object with
distances between data points.
You can optionally provide a custom function to cluster_function
. The
function must take three arguments:
-
dists
, astats::dist()
object with distances between data points -
v
, a length-1 numeric for the number of folds to create -
...
, to pass any additional named arguments to your function
The function should return a vector of cluster assignments of length
nrow(data)
, with each element of the vector corresponding to the matching
row of the data frame.
Value
A tibble with classes spatial_clustering_cv
, spatial_rset
,
rset
, tbl_df
, tbl
, and data.frame
.
The results include a column for the data split objects and
an identification variable id
.
Resamples created from non-sf
objects will not have the
spatial_rset
class.
Changes in spatialsample 0.3.0
As of spatialsample version 0.3.0, this function no longer accepts non-sf
objects as arguments to data
. In order to perform clustering with
non-spatial data, consider using rsample::clustering_cv()
.
Also as of version 0.3.0, this function now calculates edge-to-edge distance for non-point geometries, in line with the rest of the package. Earlier versions relied upon between-centroid distances.
References
A. Brenning, "Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest," 2012 IEEE International Geoscience and Remote Sensing Symposium, Munich, 2012, pp. 5372-5375, doi: 10.1109/IGARSS.2012.6352393.
Examples
data(Smithsonian, package = "modeldata")
smithsonian_sf <- sf::st_as_sf(
Smithsonian,
coords = c("longitude", "latitude"),
# Set CRS to WGS84
crs = 4326
)
# When providing sf objects, coords are inferred automatically
spatial_clustering_cv(smithsonian_sf, v = 5)
# Can use hclust instead:
spatial_clustering_cv(smithsonian_sf, v = 5, cluster_function = "hclust")