clustr {divvy} | R Documentation |
Cluster localities within regions of nearest neighbours
Description
Spatially subsample a dataset based on minimum spanning trees connecting points within regions of set extent, with optional rarefaction to a site quota.
Usage
clustr(
dat,
xy,
iter,
nSite = NULL,
distMax,
nMin = 3,
crs = "epsg:4326",
output = "locs"
)
Arguments
dat |
A |
xy |
A vector of two elements, specifying the name or numeric position
of columns in |
iter |
The number of spatial subsamples to return |
nSite |
The quota of unique locations to include in each subsample. |
distMax |
Numeric value for maximum diameter (km) allowed across locations in a subsample |
nMin |
Numeric value for the minimum number of sites to be included in
every returned subsample. If |
crs |
Coordinate reference system as a GDAL text string, EPSG code,
or object of class |
output |
Whether the returned data should be two columns of
subsample site coordinates ( |
Details
Lagomarcino and Miller (2012) developed an iterative approach of aggregating
localities to build clusters based on convex hulls, inspired by species-area
curve analysis (Scheiner 2003). Close et al. (2017, 2020) refined the approach and
changed the proximity metric from minimum convex hull area to minimum spanning
tree length. The present implementation adapts code from Close et al. (2020)
to add an option for site rarefaction after cluster construction and to grow
trees at random starting points iter
number of times (instead of a
deterministic, exhaustive iteration at every unique location).
The function takes a single location as a starting (seed) point; the seed
and its nearest neighbour initiate a spatial cluster. The distance between
the two points is the first branch in a minimum spanning tree for the cluster.
The location that has the shortest distance to any points already within the
cluster is grouped in next, and its distance (branch) is added to the sum
tree length. This iterative process continues until the largest distance
between any two points in the cluster would exceed distMax
km.
In the rare case multiple candidate points are tied for minimum distance
from the cluster, one point is selected at random as the next to include.
Any tree with fewer than nMin
points is prohibited.
In the case that nSite
is supplied, nMin
argument is ignored,
and any tree with fewer than nSite
points is prohibited.
After building a tree as described above, a random set of nSite
points
within the cluster is taken (without replacement).
The nSite
argument makes clustr()
comparable with cookies()
in that it spatially standardises both extent and area/locality number.
The performance of clustr()
is designed on the assumption iter
is much larger than the number of unique localities. Internal code first
calculates the full minimum spanning tree at every viable starting point
before it then samples those trees (i.e. resamples and optionally rarefies)
for the specified number of iterations. This sequence means the total
run-time increases only marginally even as iter
increases greatly.
However, if there are a large number of sites, particularly a large number
of densely-spaced sites, the calculations will be slow even for a
small number of iterations.
Value
A list of length iter
. Each element is a data.frame
(or matrix
, if dat
is a matrix
and output = 'full'
).
If nSite
is supplied, each element contains nSite
observations.
If output = 'locs'
(default), only the coordinates of subsampling
locations are returned.
If output = 'full'
, all dat
columns are returned for the
rows associated with the subsampled locations.
References
Antell GT, Kiessling W, Aberhan M, Saupe EE (2020). “Marine biodiversity and geographic distributions are independent on large scales.” Current Biology, 30(1), 115-121. doi:10.1016/j.cub.2019.10.065.
Close RA, Benson RB, Upchurch P, Butler RJ (2017). “Controlling for the species–area effect supports constrained long-term Mesozoic terrestrial vertebrate diversification.” Nature Communications, 8(1), 1–11. doi:10.1038/ncomms15381.
Close RA, Benson RB, Saupe EE, Clapham ME, Butler RJ (2020). “The spatial structure of Phanerozoic marine animal diversity.” Science, 368(6489), 420-424. doi:10.1126/science.aay8309.
Lagomarcino AJ, Miller AI (2012). “The relationship between genus richness and geographic area in Late Cretaceous marine biotas: Epicontinental sea versus open-ocean-facing settings.” PloS One, 7(8), e40472. doi:10.1371/journal.pone.0040472.
Scheiner SM (2003). “Six types of species–area curves.” Global Ecology and Biogeography, 12(6), 441-447. doi:10.1046/j.1466-822X.2003.00061.x.
See Also
Examples
# generate occurrences: 10 lat-long points in modern Australia
n <- 10
x <- seq(from = 140, to = 145, length.out = n)
y <- seq(from = -20, to = -25, length.out = n)
pts <- data.frame(x, y)
# sample 5 sets of 4 locations no more than 400km across
clustr(dat = pts, xy = 1:2, iter = 5,
nSite = 4, distMax = 400)