R: Cluster localities within regions of nearest neighbours

clustr {divvy}

R Documentation

Cluster localities within regions of nearest neighbours

Description

Spatially subsample a dataset based on minimum spanning trees connecting points within regions of set extent, with optional rarefaction to a site quota.

Usage

clustr(
  dat,
  xy,
  iter,
  nSite = NULL,
  distMax,
  nMin = 3,
  crs = "epsg:4326",
  output = "locs"
)

Arguments

`dat`	A `data.frame` or `matrix` containing the coordinate columns `xy` and any associated variables, e.g. taxon names.
`xy`	A vector of two elements, specifying the name or numeric position of columns in `dat` containing coordinates, e.g. longitude and latitude. Coordinates for any shared sampling sites should be identical, and where sites are raster cells, coordinates are usually expected to be cell centroids.
`iter`	The number of spatial subsamples to return
`nSite`	The quota of unique locations to include in each subsample.
`distMax`	Numeric value for maximum diameter (km) allowed across locations in a subsample
`nMin`	Numeric value for the minimum number of sites to be included in every returned subsample. If `nSite` supplied, `nMin` ignored.
`crs`	Coordinate reference system as a GDAL text string, EPSG code, or object of class `crs`. Default is latitude-longitude (`EPSG:4326`).
`output`	Whether the returned data should be two columns of subsample site coordinates (`output = 'locs'`) or the subset of rows from `dat` associated with those coordinates (`output = 'full'`).

Details

Lagomarcino and Miller (2012) developed an iterative approach of aggregating localities to build clusters based on convex hulls, inspired by species-area curve analysis (Scheiner 2003). Close et al. (2017, 2020) refined the approach and changed the proximity metric from minimum convex hull area to minimum spanning tree length. The present implementation adapts code from Close et al. (2020) to add an option for site rarefaction after cluster construction and to grow trees at random starting points iter number of times (instead of a deterministic, exhaustive iteration at every unique location).

The function takes a single location as a starting (seed) point; the seed and its nearest neighbour initiate a spatial cluster. The distance between the two points is the first branch in a minimum spanning tree for the cluster. The location that has the shortest distance to any points already within the cluster is grouped in next, and its distance (branch) is added to the sum tree length. This iterative process continues until the largest distance between any two points in the cluster would exceed distMax km. In the rare case multiple candidate points are tied for minimum distance from the cluster, one point is selected at random as the next to include. Any tree with fewer than nMin points is prohibited.

In the case that nSite is supplied, nMin argument is ignored, and any tree with fewer than nSite points is prohibited. After building a tree as described above, a random set of nSite points within the cluster is taken (without replacement). The nSite argument makes clustr() comparable with cookies() in that it spatially standardises both extent and area/locality number.

The performance of clustr() is designed on the assumption iter is much larger than the number of unique localities. Internal code first calculates the full minimum spanning tree at every viable starting point before it then samples those trees (i.e. resamples and optionally rarefies) for the specified number of iterations. This sequence means the total run-time increases only marginally even as iter increases greatly. However, if there are a large number of sites, particularly a large number of densely-spaced sites, the calculations will be slow even for a small number of iterations.

Value

A list of length iter. Each element is a data.frame (or matrix, if dat is a matrix and output = 'full'). If nSite is supplied, each element contains nSite observations. If output = 'locs' (default), only the coordinates of subsampling locations are returned. If output = 'full', all dat columns are returned for the rows associated with the subsampled locations.

References

Antell GT, Kiessling W, Aberhan M, Saupe EE (2020). “Marine biodiversity and geographic distributions are independent on large scales.” Current Biology, 30(1), 115-121. doi:10.1016/j.cub.2019.10.065.

Close RA, Benson RB, Upchurch P, Butler RJ (2017). “Controlling for the species–area effect supports constrained long-term Mesozoic terrestrial vertebrate diversification.” Nature Communications, 8(1), 1–11. doi:10.1038/ncomms15381.

Close RA, Benson RB, Saupe EE, Clapham ME, Butler RJ (2020). “The spatial structure of Phanerozoic marine animal diversity.” Science, 368(6489), 420-424. doi:10.1126/science.aay8309.

Lagomarcino AJ, Miller AI (2012). “The relationship between genus richness and geographic area in Late Cretaceous marine biotas: Epicontinental sea versus open-ocean-facing settings.” PloS One, 7(8), e40472. doi:10.1371/journal.pone.0040472.

Scheiner SM (2003). “Six types of species–area curves.” Global Ecology and Biogeography, 12(6), 441-447. doi:10.1046/j.1466-822X.2003.00061.x.

Examples

# generate occurrences: 10 lat-long points in modern Australia
n <- 10
x <- seq(from = 140, to = 145, length.out = n)
y <- seq(from = -20, to = -25, length.out = n)
pts <- data.frame(x, y)

# sample 5 sets of 4 locations no more than 400km across
clustr(dat = pts, xy = 1:2, iter = 5,
       nSite = 4, distMax = 400)

[Package divvy version 1.0.0 Index]