R: Detect outliers in a set of geographical coordinates

outliers.detect {gecko}

R Documentation

Detect outliers in a set of geographical coordinates

Description

This function generates pseudo-abscences from an input data.frame containing latitude and longitude coordinates by using environmental data and then uses both presences and pseudo-absences to train a SVM model used to flag possible outliers for a given species.

Usage

outliers.detect(
  longlat,
  training = NULL,
  hi_res = TRUE,
  crop = FALSE,
  threshold = 0.05,
  method = "all"
)

Arguments

`longlat`	data.frame. With two columns containing latitude and longitude, describing the locations of a species, which may contain outliers.
`training`	data.frame. With the same formatting as `longlat`, indicating only known locations where a target species occurs. Used exclusively as training data for method 'svm'.
`hi_res`	logical. Specifies if 1 KM resolution environmental data should be used. If `FALSE` 10 KM resolution data is used instead.
`crop`	logical. Indicates whether environmental data should be cropped to an extent similar to what is given in `longlat` and `training`. Useful to avoid large processing times of higher resolutions.
`threshold`	numeric. Value indicating the threshold for classifying outliers in methods `"geo"` and `"env"`. E.g.: under the default of 0.05, points that are at an average distance greater than the 95 of the average distances of all points, will be classified as outliers.
`method`	A string specifying the outlier detection method. `"geo"` calculates the euclidean distance between point coordinates and classifies as outliers those outside the 0 `"env"` performs the same calculation but instead uses the environmental data extracted from those points. `"svm"` will use the dataset given to `"longlat"` and it corresponding extracted environmental data to train a support vector machine model that then predicts outliers.

Details

Environmental data used is WorldClim and requires a long download, see gecko::gecko.setDir() This function is heavily based on the methods described in Liu et al. (2017). There the authors describe SVM_pdSDM, a pseudo-SDM method similar to a two-class presence only SVM that is capable of using pseudo-absence points, implemented with the ksvm function in the R package kernlab. It is suggested that, for each set of "n" occurence records, "2 * n" pseudo-absences points are generated. Whilst using it keep in mind works highlighting limitations such as such as Meynard et al. (2019). See References section.

Value

list if method = "all", containing whether or not a given point was classified as TRUE or FALSE along with the confusion matrix for the training data. If method = "geo" or method = "env" a data.frame is returned.

References

Liu, C., White, M. and Newell, G. (2017) ‘Detecting outliers in species distribution data’, Journal of Biogeography, 45(1), pp. 164–176. doi:10.1111/jbi.13122.

Meynard, C.N., Kaplan, D.M. and Leroy, B. (2019) ‘Detecting outliers in species distribution data: Some caveats and clarifications on a virtual species study’, Journal of Biogeography, 46(9), pp. 2141–2144. doi:10.1111/jbi.13626.

Examples

## Not run: 
new_occurences = gecko.data("records")
old_occurences = data.frame(X = runif(10, -17.1, -17.05), Y = runif(10, 32.73, 32.76))
outliers.detect(new_occurences, old_occurences)

## End(Not run)

[Package gecko version 1.0.0 Index]