dct_validate {dwctaxon}R Documentation

Validate a taxonomic database

Description

Runs a series of automated checks on a taxonomic database in Darwin Core (DwC) format.

Usage

dct_validate(
  tax_dat,
  check_taxon_id = dct_options()$check_taxon_id,
  check_tax_status = dct_options()$check_tax_status,
  check_mapping_accepted = dct_options()$check_mapping_accepted,
  check_mapping_parent = dct_options()$check_mapping_parent,
  check_mapping_original = dct_options()$check_mapping_original,
  check_mapping_accepted_status = dct_options()$check_mapping_accepted_status,
  check_sci_name = dct_options()$check_sci_name,
  check_status_diff = dct_options()$check_status_diff,
  check_col_names = dct_options()$check_col_names,
  valid_tax_status = dct_options()$valid_tax_status,
  extra_cols = dct_options()$extra_cols,
  on_success = dct_options()$on_success,
  on_fail = dct_options()$on_fail,
  skip_missing_cols = dct_options()$skip_missing_cols,
  quiet = dct_options()$quiet
)

Arguments

tax_dat

Dataframe; taxonomic database in DwC format.

check_taxon_id

Logical vector of length 1; should all instances of taxonID be required to be non-missing and unique? Default TRUE.

check_tax_status

Logical vector of length 1; should all taxonomic names be required to have a valid value for taxonomic status (by default, "accepted", "synonym", or "variant")? Default TRUE.

check_mapping_accepted

Logical vector of length 1; should all values of acceptedNameUsageID be required to map to the taxonID of an existing name? Default TRUE.

check_mapping_parent

Logical vector of length 1; should all values of parentNameUsageID be required to map to the taxonID of an existing name? Default TRUE.

check_mapping_original

Logical vector of length 1; should all values of originalNameUsageID be required to map to the taxonID of an existing name? Default TRUE.

check_mapping_accepted_status

Logical vector of length 1; should rules about mapping of variants and synonyms be enforced? Default FALSE. (see Details).

check_sci_name

Logical vector of length 1; should all instances of scientificName be required to be non-missing and unique? Default TRUE.

check_status_diff

Logical vector of length 1; should each scientific name be allowed to have only one taxonomic status? Default FALSE.

check_col_names

Logical vector of length 1; should all column names be required to be a valid DwC term? Default TRUE.

valid_tax_status

Character vector of length 1; valid values for taxonomicStatus. Each value must be separated by a comma. Default ⁠accepted, synonym, variant, NA⁠. "NA" indicates that missing (NA) values are valid. Case-sensitive.

extra_cols

Character vector; names of columns that should be allowed beyond those defined by the DwC taxon standard. Default NULL. Providing column name(s) that are valid DwC taxon column(s) has no effect.

on_success

Character vector of length 1, either "logical" or "data". Describes what to do if the check passes. Default "data".

on_fail

Character vector of length 1, either "error" or "summary". Describes what to do if the check fails. Default "error".

skip_missing_cols

Logical vector of length 1; should checks be silently skipped if any of the columns they inspect are missing? Default FALSE.

quiet

Logical vector of length 1; should warnings be silenced? Default FALSE.

Details

For check_mapping_accepted_status and check_status_diff, "accepted", "synonym", and "variant" are determined by string matching of taxonomicStatus; so "provisionally accepted" is counted as "accepted", "ambiguous synonym" is counted as "synonym", etc. (case-sensitive).

For check_mapping_accepted_status, the following rules are enforced:

Default settings of all arguments can be modified with dct_options() (see Examples).

Most columns are expected to be vectors of class character, but this is not checked for all columns. Columns (DwC terms) with names including 'ID', for example 'taxonID', may be character, numeric, or integer.

Value

Depends on the result of the check and on values of on_fail and on_success:

Examples

# The example dataset dct_filmies is already correctly formatted and passes
# validation
dct_validate(dct_filmies)

# So make some bad data on purpose with a duplicated scientific name
bad_dat <- dct_filmies
bad_dat$scientificName[1] <- bad_dat$scientificName[2]

# The incorrectly formatted data won't pass
try(
  dct_validate(bad_dat)
)

# It will pass if we allow duplicated scientific names though
dct_validate(bad_dat, check_sci_name = FALSE)

# Individual checks can also be turned or off with dct_options()

# First save the current settings before making any changes
old_settings <- dct_options()

# Let's allow duplicated scientific names by default
dct_options(check_sci_name = FALSE)

# The data passes validation as before, but we don't have to specify
# `check_sci_name = FALSE` in the function call
dct_validate(bad_dat)

# Reset options to those before this example was run
do.call(dct_options, old_settings)

[Package dwctaxon version 2.0.3 Index]