geoFold {enmSdmX}R Documentation

Assign geographically-distinct k-folds

Description

This function generates geographically-distinct cross-validation folds, or "geo-folds" ("g-folds" for short). Points are grouped by proximity to one another. Folds can be forced to have at least a minimum number of points in them. Results are deterministic (i.e., the same every time for the same data).

More specifically, g-folds are created using this process:

The potential downside of this approach is that the last fold is assigned the remainder of points, so will be the largest. One way to avoid gross imbalance is to select the value of minIn such that it divides the points into nearly equally-sized groups.

Usage

geoFold(x, k, minIn = 1, longLat = 1:2, method = "complete", ...)

Arguments

x

A "spatial points" object of class SpatVector, sf, data.frame, or matrix. If x is a data.frame or matrix, then the points will be assumed to have the WGS84 coordinate system (i.e., unprojected).

k

Numeric: Number of folds to create.

minIn

Numeric: Minimum number of points required to be in a fold.

longLat

Character or integer vector: This is ignored if x is a SpatVector or sf object. However, if x is a data.frame or matrix, then this should be a character or integer vector specifying the columns in x corresponding to longitude and latitude (in that order). For example, c('long', 'lat') or c(1, 2). The default is to assume that the first two columns in x represent coordinates.

method

Character: Method used by hclust to cluster points. By default, this is 'single', but in some cases this may result in strange clustering (especially when there is a large number of points). The 'complete' method (or others) may give more reasonable results in these cases.

...

Additional arguments (unused)

Details

Note that in general it is probably mathematically impossible to cluster points in 2-dimensional space into k groups, each with at least minIn points, in a manner that seems "reasonable" to the eye in all cases. In experimentation, "unreasonable" results often appear when the number of groups is high.

Value

A vector of integers the same length as the number of points in x. Each integer indicates which fold a point in x belongs to.

See Also

geoFoldContrast

Examples

library(sf)
library(terra)

# lemur occurrence data
data(mad0)
data(lemurs)
crs <- getCRS('WGS84')
ll <- c('longitude', 'latitude')

# use occurrences of all species... easier to see on map
occs <- st_as_sf(lemurs, coords = ll, crs = getCRS('WGS84'))

# create 100 background points
mad0 <- vect(mad0)
bg <- spatSample(mad0, 100)

### assign 3 folds to occurrences and to background sites
k <- 3
minIn <- floor(nrow(occs) / k) # maximally spread between folds

presFolds <- geoFold(occs, k = k, minIn = minIn)
bgFolds <- geoFoldContrast(bg, pres = occs, presFolds = presFolds)

# number of sites per fold
table(presFolds)
table(bgFolds)

# map
plot(mad0, border = 'gray', main = paste(k, 'geo-folds'))
plot(bg, pch = 3, col = bgFolds + 1, add = TRUE)
plot(st_geometry(occs), pch = 20 + presFolds, bg = presFolds + 1, add = TRUE)

legend(
	'bottomright',
	legend = c(
		'presence fold 1',
		'presence fold 2',
		'presence fold 3',
		'background fold 1',
		'background fold 2',
		'background fold 3'
	),
	pch = c(21, 22, 23, 3, 3),
	col = c(rep('black', 3), 2, 3),
	pt.bg = c(2, 3, 4, NA, NA)
)

[Package enmSdmX version 1.1.6 Index]