| clustr {divvy} | R Documentation |
Cluster localities within regions of nearest neighbours
Description
Spatially subsample a dataset based on minimum spanning trees connecting points within regions of set extent, with optional rarefaction to a site quota.
Usage
clustr(
dat,
xy,
iter,
nSite = NULL,
distMax,
nMin = 3,
crs = "epsg:4326",
output = "locs"
)
Arguments
dat |
A |
xy |
A vector of two elements, specifying the name or numeric position
of columns in |
iter |
The number of spatial subsamples to return |
nSite |
The quota of unique locations to include in each subsample. |
distMax |
Numeric value for maximum diameter (km) allowed across locations in a subsample |
nMin |
Numeric value for the minimum number of sites to be included in
every returned subsample. If |
crs |
Coordinate reference system as a GDAL text string, EPSG code,
or object of class |
output |
Whether the returned data should be two columns of
subsample site coordinates ( |
Details
Lagomarcino and Miller (2012) developed an iterative approach of aggregating
localities to build clusters based on convex hulls, inspired by species-area
curve analysis (Scheiner 2003). Close et al. (2017, 2020) refined the approach and
changed the proximity metric from minimum convex hull area to minimum spanning
tree length. The present implementation adapts code from Close et al. (2020)
to add an option for site rarefaction after cluster construction and to grow
trees at random starting points iter number of times (instead of a
deterministic, exhaustive iteration at every unique location).
The function takes a single location as a starting (seed) point; the seed
and its nearest neighbour initiate a spatial cluster. The distance between
the two points is the first branch in a minimum spanning tree for the cluster.
The location that has the shortest distance to any points already within the
cluster is grouped in next, and its distance (branch) is added to the sum
tree length. This iterative process continues until the largest distance
between any two points in the cluster would exceed distMax km.
In the rare case multiple candidate points are tied for minimum distance
from the cluster, one point is selected at random as the next to include.
Any tree with fewer than nMin points is prohibited.
In the case that nSite is supplied, nMin argument is ignored,
and any tree with fewer than nSite points is prohibited.
After building a tree as described above, a random set of nSite points
within the cluster is taken (without replacement).
The nSite argument makes clustr() comparable with cookies()
in that it spatially standardises both extent and area/locality number.
The performance of clustr() is designed on the assumption iter
is much larger than the number of unique localities. Internal code first
calculates the full minimum spanning tree at every viable starting point
before it then samples those trees (i.e. resamples and optionally rarefies)
for the specified number of iterations. This sequence means the total
run-time increases only marginally even as iter increases greatly.
However, if there are a large number of sites, particularly a large number
of densely-spaced sites, the calculations will be slow even for a
small number of iterations.
Value
A list of length iter. Each element is a data.frame
(or matrix, if dat is a matrix and output = 'full').
If nSite is supplied, each element contains nSite observations.
If output = 'locs' (default), only the coordinates of subsampling
locations are returned.
If output = 'full', all dat columns are returned for the
rows associated with the subsampled locations.
References
Antell GT, Kiessling W, Aberhan M, Saupe EE (2020). “Marine biodiversity and geographic distributions are independent on large scales.” Current Biology, 30(1), 115-121. doi:10.1016/j.cub.2019.10.065.
Close RA, Benson RB, Upchurch P, Butler RJ (2017). “Controlling for the species–area effect supports constrained long-term Mesozoic terrestrial vertebrate diversification.” Nature Communications, 8(1), 1–11. doi:10.1038/ncomms15381.
Close RA, Benson RB, Saupe EE, Clapham ME, Butler RJ (2020). “The spatial structure of Phanerozoic marine animal diversity.” Science, 368(6489), 420-424. doi:10.1126/science.aay8309.
Lagomarcino AJ, Miller AI (2012). “The relationship between genus richness and geographic area in Late Cretaceous marine biotas: Epicontinental sea versus open-ocean-facing settings.” PloS One, 7(8), e40472. doi:10.1371/journal.pone.0040472.
Scheiner SM (2003). “Six types of species–area curves.” Global Ecology and Biogeography, 12(6), 441-447. doi:10.1046/j.1466-822X.2003.00061.x.
See Also
Examples
# generate occurrences: 10 lat-long points in modern Australia
n <- 10
x <- seq(from = 140, to = 145, length.out = n)
y <- seq(from = -20, to = -25, length.out = n)
pts <- data.frame(x, y)
# sample 5 sets of 4 locations no more than 400km across
clustr(dat = pts, xy = 1:2, iter = 5,
nSite = 4, distMax = 400)