cd_ddmm {CoordinateCleaner} | R Documentation |
Identify Datasets with a Degree Conversion Error
Description
This test flags datasets where a significant fraction of records has been subject to a common degree minute to decimal degree conversion error, where the degree sign is recognized as decimal delimiter.
Usage
cd_ddmm(
x,
lon = "decimalLongitude",
lat = "decimalLatitude",
ds = "dataset",
pvalue = 0.025,
diff = 1,
mat_size = 1000,
min_span = 2,
value = "clean",
verbose = TRUE,
diagnostic = FALSE
)
Arguments
x |
data.frame. Containing geographical coordinates and species names. |
lon |
character string. The column with the longitude coordinates. Default = “decimalLongitude”. |
lat |
character string. The column with the latitude coordinates. Default = “decimalLatitude”. |
ds |
a character string. The column with the dataset of each record. In
case |
pvalue |
numeric. The p-value for the one-sided t-test to flag the test as passed or not. Both ddmm.pvalue and diff must be met. Default = 0.025. |
diff |
numeric. The threshold difference for the ddmm test. Indicates by which fraction the records with decimals below 0.6 must outnumber the records with decimals above 0.6. Default = 1 |
mat_size |
numeric. The size of the matrix for the binomial test. Must be changed in decimals (e.g. 100, 1000, 10000). Adapt to dataset size, generally 100 is better for datasets < 10000 records, 1000 is better for datasets with 10000 - 1M records. Higher values also work reasonably well for smaller datasets, therefore, default = 1000. For large datasets try 10000. |
min_span |
numeric. The minimum geographic extent of datasets to be tested. Default = 2. |
value |
character string. Defining the output value. See value. |
verbose |
logical. If TRUE reports the name of the test and the number of records flagged. |
diagnostic |
logical. If TRUE plots the analyses matrix for each dataset. |
Details
If the degree sign is recognized as decimal delimiter during coordinate
conversion, no coordinate decimals above 0.59 (59') are possible. The test
here uses a binomial test to test if a significant proportion of records in
a dataset have been subject to this problem. The test is best adjusted via
the diff argument. The lower diff
, the stricter the test. Also scales
with dataset size. Empirically, for datasets with < 5,000 unique coordinate
records diff = 0.1
has proven reasonable flagging most datasets with
>25% problematic records and all dataset with >50% problematic records.
For datasets between 5,000 and 100,000 geographic unique records diff
= 0.01
is recommended, for datasets between 100,000 and 1 M records diff =
0.001, and so on.
Value
Depending on the ‘value’ argument, either a data.frame
with summary statistics and flags for each dataset (“dataset”) or a
data.frame
containing the records considered correct by the test
(“clean”) or a logical vector (“flags”), with TRUE = test passed and FALSE =
test failed/potentially problematic. Default =
“clean”.
Note
See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
See Also
Other Datasets:
cd_round()
Examples
clean <- data.frame(species = letters[1:10],
decimalLongitude = runif(100, -180, 180),
decimalLatitude = runif(100, -90,90),
dataset = "FR")
cd_ddmm(x = clean, value = "flagged")
#problematic dataset
lon <- sample(0:180, size = 100, replace = TRUE) + runif(100, 0,0.59)
lat <- sample(0:90, size = 100, replace = TRUE) + runif(100, 0,0.59)
prob <- data.frame(species = letters[1:10],
decimalLongitude = lon,
decimalLatitude = lat,
dataset = "FR")
cd_ddmm(x = prob, value = "flagged")