clean_dataset {CoordinateCleaner} | R Documentation |
Coordinate Cleaning using Dataset Properties
Description
Tests for problems associated with coordinate conversions and rounding, based on dataset properties. Includes test to identify contributing datasets with potential errors with converting ddmm to dd.dd, and periodicity in the data decimals indicating rounding or a raster basis linked to low coordinate precision. Specifically:
ddmm tests for erroneous conversion from a degree minute format (ddmm) to a decimal degree (dd.dd) format
periodicity test for periodicity in the data, which can indicate imprecise coordinates, due to rounding or rasterization.
Usage
clean_dataset(
x,
lon = "decimalLongitude",
lat = "decimalLatitude",
ds = "dataset",
tests = c("ddmm", "periodicity"),
value = "dataset",
verbose = TRUE,
...
)
Arguments
x |
data.frame. Containing geographical coordinates and species names. |
lon |
character string. The column with the longitude coordinates. Default = “decimalLongitude”. |
lat |
character string. The column with the latitude coordinates. Default = “decimalLatitude”. |
ds |
a character string. The column with the dataset of each record. In
case |
tests |
a vector of character strings, indicating which tests to run. See details for all tests available. Default = c("ddmm", "periodicity") |
value |
a character string. Defining the output value. See value. Default = “dataset”. |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
... |
additional arguments to be passed to |
Details
These tests are based on the statistical distribution of coordinates and their decimals within datasets of geographic distribution records to identify datasets with potential errors/biases. Three potential error sources can be identified. The ddmm flag tests for the particular pattern that emerges if geographical coordinates in a degree minute annotation are transferred into decimal degrees, simply replacing the degree symbol with the decimal point. This kind of problem has been observed by in older datasets first recorded on paper using typewriters, where e.g. a floating point was used as symbol for degrees. The function uses a binomial test to check if more records than expected have decimals below 0.6 (which is the maximum that can be obtained in minutes, as one degree has 60 minutes) and if the number of these records is higher than those above 0.59 by a certain proportion. The periodicity test uses rate estimation in a Poisson process to estimate if there is periodicity in the decimals of a dataset (as would be expected by for example rounding or data that was collected in a raster format) and if there is an over proportional number of records with the decimal 0 (full degrees) which indicates rounding and thus low precision. The default values are empirically optimized by with GBIF data, but should probably be adapted.
Value
Depending on the ‘value’ argument:
- “dataset”
a
data.frame
with the the test summary statistics for each dataset inx
- “clean”
a
data.frame
containing only records from datasets inx
that passed the tests- “flagged”
a logical vector of the same length as rows in
x
, with TRUE = test passed and FALSE = test failed/potentially problematic.
Note
See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
See Also
Other Wrapper functions:
clean_coordinates()
,
clean_fossils()
Examples
#Create test dataset
clean <- data.frame(dataset = rep("clean", 1000),
decimalLongitude = runif(min = -43, max = -40, n = 1000),
decimalLatitude = runif(min = -13, max = -10, n = 1000))
bias.long <- c(round(runif(min = -42, max = -40, n = 500), 1),
round(runif(min = -42, max = -40, n = 300), 0),
runif(min = -42, max = -40, n = 200))
bias.lat <- c(round(runif(min = -12, max = -10, n = 500), 1),
round(runif(min = -12, max = -10, n = 300), 0),
runif(min = -12, max = -10, n = 200))
bias <- data.frame(dataset = rep("biased", 1000),
decimalLongitude = bias.long,
decimalLatitude = bias.lat)
test <- rbind(clean, bias)
## Not run:
#run clean_dataset
flags <- clean_dataset(test)
#check problems
#clean
hist(test[test$dataset == rownames(flags[flags$summary,]), "decimalLongitude"])
#biased
hist(test[test$dataset == rownames(flags[!flags$summary,]), "decimalLongitude"])
## End(Not run)