outliers.detect {gecko}R Documentation

Detect outliers in a set of geographical coordinates

Description

This function generates pseudo-abscences from an input data.frame containing latitude and longitude coordinates by using environmental data and then uses both presences and pseudo-absences to train a SVM model used to flag possible outliers for a given species.

Usage

outliers.detect(
  longlat,
  training = NULL,
  hi_res = TRUE,
  crop = FALSE,
  threshold = 0.05,
  method = "all"
)

Arguments

longlat

data.frame. With two columns containing latitude and longitude, describing the locations of a species, which may contain outliers.

training

data.frame. With the same formatting as longlat, indicating only known locations where a target species occurs. Used exclusively as training data for method 'svm'.

hi_res

logical. Specifies if 1 KM resolution environmental data should be used. If FALSE 10 KM resolution data is used instead.

crop

logical. Indicates whether environmental data should be cropped to an extent similar to what is given in longlat and training. Useful to avoid large processing times of higher resolutions.

threshold

numeric. Value indicating the threshold for classifying outliers in methods "geo" and "env". E.g.: under the default of 0.05, points that are at an average distance greater than the 95 of the average distances of all points, will be classified as outliers.

method

A string specifying the outlier detection method. "geo" calculates the euclidean distance between point coordinates and classifies as outliers those outside the 0 "env" performs the same calculation but instead uses the environmental data extracted from those points. "svm" will use the dataset given to "longlat" and it corresponding extracted environmental data to train a support vector machine model that then predicts outliers.

Details

Environmental data used is WorldClim and requires a long download, see gecko::gecko.setDir() This function is heavily based on the methods described in Liu et al. (2017). There the authors describe SVM_pdSDM, a pseudo-SDM method similar to a two-class presence only SVM that is capable of using pseudo-absence points, implemented with the ksvm function in the R package kernlab. It is suggested that, for each set of "n" occurence records, "2 * n" pseudo-absences points are generated. Whilst using it keep in mind works highlighting limitations such as such as Meynard et al. (2019). See References section.

Value

list if method = "all", containing whether or not a given point was classified as TRUE or FALSE along with the confusion matrix for the training data. If method = "geo" or method = "env" a data.frame is returned.

References

Liu, C., White, M. and Newell, G. (2017) ‘Detecting outliers in species distribution data’, Journal of Biogeography, 45(1), pp. 164–176. doi:10.1111/jbi.13122.

Meynard, C.N., Kaplan, D.M. and Leroy, B. (2019) ‘Detecting outliers in species distribution data: Some caveats and clarifications on a virtual species study’, Journal of Biogeography, 46(9), pp. 2141–2144. doi:10.1111/jbi.13626.

Examples

## Not run: 
new_occurences = gecko.data("records")
old_occurences = data.frame(X = runif(10, -17.1, -17.05), Y = runif(10, 32.73, 32.76))
outliers.detect(new_occurences, old_occurences)

## End(Not run)

[Package gecko version 1.0.0 Index]