R: Identify and remove erroneous reference sequences.

purge {insect}

R Documentation

Identify and remove erroneous reference sequences.

Description

This function evaluates a DNA reference database (a "DNAbin" object) and removes any sequences whose taxonomic metadata appear to be inconsistent with those of their most closely related sequences.

Usage

purge(x, db, level = "order", confidence = 0.8, cores = 1, quiet = FALSE, ...)

Arguments

`x`	a DNAbin list object whose names include taxonomic identification numbers (see `searchGB` for details).
`db`	a valid taxonomy database containing the taxonomic identification numbers included in the "names" attribute of the primary input object (a data.frame object; see `taxonomy`).
`level`	character string giving the taxonomic level at which heterogeneity within a cluster will flag a sequence as potentially erroneous. This should be a recognized rank within the taxonomy database.
`confidence`	numeric, the minimum confidence value for a sequence to be purged. For example, if `confidence = 0.8` (the default value) a sequence will only be purged if its taxonomy differs from at least four other independent sequences in its cluster.
`cores`	integer giving the number of processors for multithreading. Defaults to 1. This argument may alternatively be a 'cluster' object, in which case it is the user's responsibility to close the socket connection at the conclusion of the operation, for example by running `parallel::stopCluster(cores)`. The string 'autodetect' is also accepted, in which case the maximum number of cores to use is one less than the total number of cores available. Note that in this case there may be a tradeoff in terms of speed depending on the number and size of sequences to be processed, due to the extra time required to initialize the cluster.
`quiet`	logical indicating whether progress should be printed to the console.
`...`	further arguments to pass to `otu` (not including `nstart`).

Details

This function first clusters the sequence dataset into operational taxonomic units (OTUs) based on a given genetic similarity threshold using the otu function from the kmer package. Each cluster is then checked for taxonomic homogeneity at a given rank, and any sequences that appear out of place are removed. The criteria for sequence removal are that at least two other independent studies should contradict the taxonomic metadata attributed to the sequence.

Value

a "DNAbin" object.

Author(s)

Shaun Wilkinson

Examples

  data(whales)
  data(whale_taxonomy)
  whales <- purge(whales, db = whale_taxonomy, level = "species",
                  threshold = 0.97, method = "farthest")

[Package insect version 1.4.2 Index]