purge {insect} | R Documentation |
Identify and remove erroneous reference sequences.
Description
This function evaluates a DNA reference database (a "DNAbin" object) and removes any sequences whose taxonomic metadata appear to be inconsistent with those of their most closely related sequences.
Usage
purge(x, db, level = "order", confidence = 0.8, cores = 1, quiet = FALSE, ...)
Arguments
x |
a DNAbin list object whose names include taxonomic identification numbers
(see |
db |
a valid taxonomy database containing the taxonomic identification numbers
included in the "names" attribute of the primary input object (a data.frame object;
see |
level |
character string giving the taxonomic level at which heterogeneity within a cluster will flag a sequence as potentially erroneous. This should be a recognized rank within the taxonomy database. |
confidence |
numeric, the minimum confidence value for a sequence to be purged.
For example, if |
cores |
integer giving the number of processors for multithreading. Defaults to 1.
This argument may alternatively be a 'cluster' object,
in which case it is the user's responsibility to close the socket
connection at the conclusion of the operation,
for example by running |
quiet |
logical indicating whether progress should be printed to the console. |
... |
further arguments to pass to |
Details
This function first clusters the sequence dataset into operational
taxonomic units (OTUs) based on a given genetic similarity threshold
using the otu
function from the kmer
package.
Each cluster is then checked for taxonomic homogeneity at a given rank,
and any sequences that appear out of place are removed.
The criteria for sequence removal are that at least two other independent
studies should contradict the taxonomic metadata attributed to the sequence.
Value
a "DNAbin" object.
Author(s)
Shaun Wilkinson
Examples
data(whales)
data(whale_taxonomy)
whales <- purge(whales, db = whale_taxonomy, level = "species",
threshold = 0.97, method = "farthest")