R: Identifies duplicate occurrence records

dupeSummary {BeeBDC}

R Documentation

Identifies duplicate occurrence records

Description

This function uses user-specified inputs and columns to identify duplicate occurrence records. Duplicates are identified iteratively and will be tallied up, duplicate pairs clustered, and sorted at the end of the function. The function is designed to work with Darwin Core data with a database_id column, but it is also modifiable to work with other columns.

Usage

dupeSummary(
  data = NULL,
  path = NULL,
  duplicatedBy = NULL,
  completeness_cols = NULL,
  idColumns = NULL,
  collectionCols = NULL,
  collectInfoColumns = NULL,
  CustomComparisonsRAW = NULL,
  CustomComparisons = NULL,
  sourceOrder = NULL,
  prefixOrder = NULL,
  dontFilterThese = c(".gridSummary", ".lonFlag", ".latFlag", ".uncer_terms",
    ".uncertaintyThreshold", ".unLicensed"),
  characterThreshold = 2,
  numberThreshold = 3,
  numberOnlyThreshold = 5,
  catalogSwitch = TRUE
)

Arguments

`data`	A data frame or tibble. Occurrence records as input.
`path`	A character path to the location where the duplicateRun_ file will be saved.
`duplicatedBy`	A character vector. Options are c("ID", "collectionInfo", "both"). "ID" columns runs through a series of ID-only columns defined by idColumns. "collectionInfo" runs through a series of columns defined by collectInfoColumns, which are checked in combination with collectionCols. "both" runs both of the above.
`completeness_cols`	A character vector. A set of columns that are used to order and select duplicates by. For each occurrence, this function will calculate the sum of `complete.cases()`. Within duplicate clusters occurrences with a greater number of the completeness_cols filled in will be kept over those with fewer.
`idColumns`	A character vector. The columns to be checked individually for internal duplicates. Intended for use with ID columns only.
`collectionCols`	A character vector. The columns to be checked in combination with each of the completeness_cols.
`collectInfoColumns`	A character vector. The columns to be checked in combinatino with all of the collectionCols columns.
`CustomComparisonsRAW`	A list of character vectors. Custom comparisons - as a list of columns to iteratively compare for duplicates. These differ from the CustomComparisons in that they ignore the minimum number and character thresholds for IDs.
`CustomComparisons`	A list of character vectors. Custom comparisons - as a list of columns to iteratively compare for duplicates. These comparisons are made after character and number thresholds are accounted for in ID columns.
`sourceOrder`	A character vector. The order in which you want to KEEP duplicated based on the dataSource column (i.e. what order to prioritize data sources). NOTE: These dataSources are simplified to the string prior to the first "_". Hence, "GBIF_Anthophyla" becomes "GBIF."
`prefixOrder`	A character vector. Like sourceOrder, except based on the database_id prefix, rather than the dataSource. Additionally, this is only examined if prefixOrder != NULL. Default = NULL.
`dontFilterThese`	A character vector. This should contain the flag columns to be ignored in the creation or updating of the .summary column. Passed to `summaryFun()`.
`characterThreshold`	Numeric. The complexity threshold for ID letter length. This is the minimum number of characters that need to be present in ADDITION TO the numberThreshold for an ID number to be tested for duplicates. Ignored by CustomComparisonsRAW. The columns that are checked are occurrenceID, recordId, id, catalogNumber, and otherCatalogNumbers. Default = 2.
`numberThreshold`	Numeric. The complexity threshold for ID number length. This is the minimum number of numeric characters that need to be present in ADDITION TO the characterThreshold for an ID number to be tested for duplicates. Ignored by CustomComparisonsRAW. The columns that are checked are occurrenceID, recordId, id, catalogNumber, and otherCatalogNumbers. Default = 3.
`numberOnlyThreshold`	Numeric. As numberThreshold except the characterThreshold is ignored. Default = 5.
`catalogSwitch`	Logical. If TRUE, and the catalogNumber is empty the function will copy over the otherCatalogNumbers into catalogNumber and visa versa. Hence, the function will attempt to matchmore catalog numbers as both of these functions can be problematic. Default = TRUE.

Value

Returns data with an additional column called .duplicates where FALSE occurrences are duplicates and TRUE occurrences are either kept duplicates or unique. Also exports a .csv to the user-specified location with information about duplicate matching. This file is used by other functions including manualOutlierFindeR() and chordDiagramR()

Examples

beesFlagged_out <- dupeSummary(
data = BeeBDC::beesFlagged,
  # Should start with paste0(DataPath, "/Output/Report/"), instead of tempdir():
path = paste0(tempdir(), "/"),
# options are "ID","collectionInfo", or "both"
duplicatedBy = "collectionInfo", # I'm only running ID for the first lot because we might 
# recover other info later
# The columns to generate completeness info from
completeness_cols = c("decimalLatitude",  "decimalLongitude",
                      "scientificName", "eventDate"),
# idColumns = c("gbifID", "occurrenceID", "recordId","id"),
# The columns to ADDITIONALLY consider when finding duplicates in collectionInfo
collectionCols = c("decimalLatitude", "decimalLongitude", "scientificName", "eventDate", 
                   "recordedBy"),
# The columns to combine, one-by-one with the collectionCols
collectInfoColumns = c("catalogNumber", "otherCatalogNumbers"),
# Custom comparisons - as a list of columns to compare
# RAW custom comparisons do not use the character and number thresholds
CustomComparisonsRAW = dplyr::lst(c("catalogNumber", "institutionCode", "scientificName")),
# Other custom comparisons use the character and number thresholds
CustomComparisons = dplyr::lst(c("gbifID", "scientificName"),
                                c("occurrenceID", "scientificName"),
                                c("recordId", "scientificName"),
                                c("id", "scientificName")),
# The order in which you want to KEEP duplicated based on data source
# try unique(check_time$dataSource)
sourceOrder = c("CAES", "Gai", "Ecd","BMont", "BMin", "EPEL", "ASP", "KP", "EcoS", "EaCO",
                "FSCA", "Bal", "SMC", "Lic", "Arm",
                "USGS", "ALA", "GBIF","SCAN","iDigBio"),
# !!!!!! BELS > GeoLocate
# Set the complexity threshold for id letter and number length
# minimum number of characters when WITH the numberThreshold
characterThreshold = 2,
# minimum number of numbers when WITH the characterThreshold
numberThreshold = 3,
# Minimum number of numbers WITHOUT any characters
numberOnlyThreshold = 5)