check_taxonomy {fossilbrush}R Documentation

check_taxonomy

Description

Wrapper functions to implement a multi-step cleaning routine for hierarchically structured taxonomic data. The first part of the routine calls @seealso format_check to perform a few presumptive checks on all columns, scanning for non-letter characters and checking the number of words in each string. By default, @seealso clean_name is called to ensure correct formatting as this improves downstream checking. The second part of the routine calls @seealso spell_check to flag spelling discrepancies between names within a given taxonomic group. If chosen, the function can automatically impose the more frequent spelling. The third part of the routine calls @seealso discrete_ranks to flag name re-use at different taxonomic levels. Some of these cases may arise when a name has been unfortunately, (although permissibly) used to refer to groups at different taxonomic levels, or where a higher classification may have been inserted as a placeholder for a missing lower classification. The fourth part of the routine calls @seealso find_duplicates to flag variable higher classifications for a given taxon, including cases where a higher classification is missing for one instance of a taxon, but present for the others. If chosen, @seealso resolve_duplicates is called to ensure a consistent classification is imposed. For cases where a name has been re-used at the same rank for genuinely different taxa (not permissible, unlike name re-use at different ranks) suffixes are added as capital letters, e.g. TaxonA, TaxonB. If any of the automatic cleaning routines are employed (again the default behaviour as clean_name is TRUE by default), the function will return are a cleaned version of the dataset. If the use of suffixes from @seealso resolve_duplicates is not desirable, the function behaviour can be altered so that any suffixes are dropped before returning.

Usage

check_taxonomy(
  x,
  ranks = c("phylum", "class", "order", "family", "genus"),
  species = FALSE,
  species_sep = NULL,
  routine = c("format_check", "spell_check", "discrete_ranks", "find_duplicates"),
  report = TRUE,
  verbose = TRUE,
  clean_name = FALSE,
  clean_spell = FALSE,
  thresh = NULL,
  resolve_duplicates = FALSE,
  append = TRUE,
  term_set = NULL,
  collapse_set = NULL,
  jw = 0.1,
  str = 1,
  str2 = NULL,
  alternative = "jaccard",
  q = 1,
  pref_set = NULL,
  suff_set = NULL,
  exclude_set = NULL,
  jump = 3,
  plot = FALSE
)

Arguments

x

A dataframe with hierarchically organised taxonomic information. If x only comprises the taxonomic information, @param ranks does not need to be specified, but the columns must be in order of decreasing taxonomic rank

ranks

The column names of the taxonomic data fields in x. These must be provided in order of decreasing taxonomic rank

species

A logical indicating if x contains a species column. As the data must be supplied in hierarchical order, this column will naturally be the last column in x and species-specific spell checks will be performed on this column. NOTE that for the function to work, the species name must be the full species name rather than just the specific epithet, e.g., 'Tyto_alba' rather than just 'alba'.

species_sep

A character vector of length one specifying the genus name and specific epithet in the species column

  • Flagging routine arguments *

routine

A character vector determining the flagging and cleaning routines to employ. Valid values are format_check (check for non letter characters and the number of words in names), spell_check (flag potential spelling errors), discrete_rank (check that taxonomic names are unique to their rank), duplicate_tax (flag conflicting higher classifications of a given taxon)

report

A logical of length one determining if the flagging outputs of each cleaning routine should be returned to the user for inspection. This is different to @param verbose, which controls whether flagging should additionally be reported to the user on the console

verbose

A logical determining if function progress and flagged errors should be reported to the console

  • Cleaning routine arguments *

clean_name

If TRUE, the function will return cleaned versions of the columns in x using the routines in @seealso clean_name. These routines can be altered using the 'term_set' and 'collapse_set' arguments.

clean_spell

If TRUE, the function will return a cleaned version of the supplied taxonomic dataframe, using the supplied threshold for the similarity method given by method2, to automatically update any names in pairs of flagged synonyms to the more frequent spelling. This is not recommended, however, so the argument is FALSE by default and the threshold left as NULL

thresh

The threshold for the similarity method given by method2, below which flagged pairs of names will be considered synonyms and resolved automatically. See @seealso spell_check for details on method2

resolve_duplicates

If TRUE, the function will return a cleaned version of the supplied taxonomic dataframe, using @seealso resolve_duplicates to resolve conflicts in the way documented by the function. Both spell_clean and tax_clean can both be TRUE to return a dataset cleaned by both methods

append

If TRUE, any suffixes used during cleaning will be retained in the cleaned version of the data. This is preferable as it ensures that all taxonomic names are rank-discrete and uniquely classified

  • Routine specific arguments *

term_set

A character vector of terms (to be used at all ranks) or a list of rank-specific terms which will be supplied, element-wise as the @param collapse argument called by @seealso clean_name. If a list, this

collapse_set

A character vector of character strings (to be used at all ranks) or a list of rank-specific strings which will be supplied, element-wise as the @param collapse argument called by @seealso clean_name. If a list, this should be given in descending rank order

jw

Called by @seealso spell_check

str

Called by @seealso spell_check

str2

Called by @seealso spell_check

alternative

Called by @seealso spell_check

q

Called by @seealso spell_check

pref_set

A character vector of prefixes (which will be used at all ranks) or a list of rank-specific prefixes, which will be supplied, element-wise as the @param pref argument called by @seealso spell_check. If a list, this should be given in descending rank order.

suff_set

A character vector of suffixes (which will be used at all ranks) or a list of rank-specific suffixes, which will be supplied, element-wise as the @param suff argument called by @seealso spell_check. If a list, this should be given in descending rank order.

exclude_set

A character vector of terms to exclude (which will be used at all ranks) or a list of rank-specific exclusion terms, which will be supplied, element-wise as the @param exclude argument called by @seealso spell_check. If a list, this should be given in descending rank order.

jump

Called by @seealso resolve_duplicates

plot

Called by @seealso resolve_duplicates

Details

Value

A list with elements corresponding to the outputs of the chosen flagging routines (four by default: $formatting, $synonyms, $ranks, $duplicates), plus a cleaned verison of the data ($data) if any of clean_name, clean_spell or resolve_duplicates are TRUE. See @seealso format_check, @seealso spell_clean,

See Also

discrete_ranks and @seealso find_duplicates for details of the structure of the flagging outputs

Examples

# load dataset
data("brachios")
# subsample brachios to make for a short example runtime
set.seed(1)
brachios <- brachios[sample(1:nrow(brachios), 1000),]
# define the taxonomic ranks used in the dataset (re-used elsewhere)
b_ranks <- c("phylum", "class", "order", "family", "genus")
# define a list of suffixes to be used at each taxonomic level when scanning for synonyms
b_suff = list(NULL, NULL, NULL, NULL, c("ina", "ella", "etta"))
# scan for errors
brachios <- check_taxonomy(brachios, suff_set = b_suff, ranks = b_ranks)

[Package fossilbrush version 1.0.5 Index]