geoFold {enmSdmX} | R Documentation |
Assign geographically-distinct k-folds
Description
This function generates geographically-distinct cross-validation folds, or "geo-folds" ("g-folds" for short). Points are grouped by proximity to one another. Folds can be forced to have at least a minimum number of points in them. Results are deterministic (i.e., the same every time for the same data).
More specifically, g-folds are created using this process:
To start, all pairwise distances between points are calculated. These are used in a clustering algorithm to create a dendrogram of relationships by distance. The dendrogram is then "cut" so it has
k
groups (folds). If each fold has at least the minimum desired number of points (minIn
), then the process stops and fold assignments are returned.However, if at least one fold has fewer than the desired number of points, a series of steps is executed.
First, the fold with a centroid that is farthest from all others is selected. If it has sufficient points, then the next-most distant fold is selected, and so on.
Once a fold is identified that has fewer than the desired number of points, it is grown by adding to it the points closest to its centroid, one at a time. Each time a point is added, the fold centroid is calculated again. The fold is grown until it has the desired number of points. Call this "fold #1". From hereafter, these points are considered "assigned" and not eligible for re-assignment.
The remaining "unassigned" points are then clustered again, but this time into
k - 1
folds. And again, the most-distant group found that has fewer than the desired number of points is found. This fold is then grown as before, using only unassigned points. This fold then becomes "fold #2."The process repeats iteratively until there are
k
folds assigned, each with at least the desired number of points.
The potential downside of this approach is that the last fold is assigned the remainder of points, so will be the largest. One way to avoid gross imbalance is to select the value of minIn
such that it divides the points into nearly equally-sized groups.
Usage
geoFold(x, k, minIn = 1, longLat = 1:2, method = "complete", ...)
Arguments
x |
A "spatial points" object of class |
k |
Numeric: Number of folds to create. |
minIn |
Numeric: Minimum number of points required to be in a fold. |
longLat |
Character or integer vector: This is ignored if |
method |
Character: Method used by |
... |
Additional arguments (unused) |
Details
Note that in general it is probably mathematically impossible to cluster points in 2-dimensional space into k
groups, each with at least minIn
points, in a manner that seems "reasonable" to the eye in all cases. In experimentation, "unreasonable" results often appear when the number of groups is high.
Value
A vector of integers the same length as the number of points in x
. Each integer indicates which fold a point in x
belongs to.
See Also
Examples
library(sf)
library(terra)
# lemur occurrence data
data(mad0)
data(lemurs)
crs <- getCRS('WGS84')
ll <- c('longitude', 'latitude')
# use occurrences of all species... easier to see on map
occs <- st_as_sf(lemurs, coords = ll, crs = getCRS('WGS84'))
# create 100 background points
mad0 <- vect(mad0)
bg <- spatSample(mad0, 100)
### assign 3 folds to occurrences and to background sites
k <- 3
minIn <- floor(nrow(occs) / k) # maximally spread between folds
presFolds <- geoFold(occs, k = k, minIn = minIn)
bgFolds <- geoFoldContrast(bg, pres = occs, presFolds = presFolds)
# number of sites per fold
table(presFolds)
table(bgFolds)
# map
plot(mad0, border = 'gray', main = paste(k, 'geo-folds'))
plot(bg, pch = 3, col = bgFolds + 1, add = TRUE)
plot(st_geometry(occs), pch = 20 + presFolds, bg = presFolds + 1, add = TRUE)
legend(
'bottomright',
legend = c(
'presence fold 1',
'presence fold 2',
'presence fold 3',
'background fold 1',
'background fold 2',
'background fold 3'
),
pch = c(21, 22, 23, 3, 3),
col = c(rep('black', 3), 2, 3),
pt.bg = c(2, 3, 4, NA, NA)
)