classify {insect} | R Documentation |
Tree-based sequence classification.
Description
"classify"
assigns taxon IDs to DNA sequences using an
informatic sequence classification tree.
Usage
classify(
x,
tree,
threshold = 0.8,
decay = FALSE,
ping = 0.98,
mincount = 5,
offset = 0,
ranks = c("kingdom", "phylum", "class", "order", "family", "genus", "species"),
species = "ping100",
tabulize = FALSE,
metadata = FALSE,
cores = 1
)
Arguments
x |
a sequence or set of sequences. Can be a "DNAbin" or "AAbin" object or a named vector of upper-case DNA character strings. |
tree |
an object of class |
threshold |
numeric between 0 and 1 giving the minimum Akaike weight for the recursive classification procedure to continue toward the leaves of the tree. Defaults to 0.8. |
decay |
logical indicating whether the decision to terminate the classification process should be made based on decaying Akaike weights (at each node, the Akaike weight of the selected model is multiplied by the Akaike weight of the selected model at the parent node) or whether each Akaike weight should be calculated independently of that of the parent node. Defaults to FALSE (the latter). |
ping |
logical or numeric (between 0 and 1) indicating whether
a nearest neighbor search should
be carried out, and if so,
what the minimum distance to the nearest neighbor
should be for the the recursive classification algorithm to be skipped.
If TRUE and the query sequence is identical to
at least one of the training sequences used to learn the tree,
the common ancestor of the matching training sequences is returned
with an score of NA.
If a value between 0 and 1 is provided, the common ancestor of the
training sequences with similarity greater than or equal to 'ping'
is returned, again with a score of NA.
If |
mincount |
integer, the minimum number of training sequences belonging to a selected child node for the classification to progress. Defaults to 5. |
offset |
log-odds score offset parameter governing whether the minimum score is met at each node. Defaults to 0. Values above 0 increase precision (fewer type I errors), values below 0 increase recall (fewer type II errors). |
ranks |
character vector giving the taxonomic ranks to be
included in the output table. Must be a valid rank from the
taxonomy database attributed to the classification tree
( |
species |
character string, indicating whether to include all
species-level classifications in the output (species = 'all'),
only those generated by exact matching ("ping100"; the default setting),
only those generated by exact matching or near-neighbor searching
(species = 'ping'). If |
tabulize |
logical indicating whether sequence counts should be attached to the output table. If TRUE, the output table will have one row for each unique sequence, and columns will include counts for each sample (where samples names precede sequence identifiers in the input object; see details below). |
metadata |
logical indicating whether to include additional columns containing the paths, individual node scores and reasons for termination. Defaults to FALSE. Included for advanced use and debugging. |
cores |
integer giving the number of processors for multithreading (defaults to 1).
This argument may alternatively be a 'cluster' object,
in which case it is the user's responsibility to close the socket
connection at the conclusion of the operation,
for example by running |
Details
This function requires a pre-computed classification tree
of class "insect", which is a dendrogram object with additional attributes
(see learn
for details).
Query sequences obtained from the same primer set used to construct
the tree are classified to produce taxonomic
IDs with an associated degree of confidence.
The classification algorithm works as follows:
starting from the root node of the tree,
the log-likelihood of the query sequence
(the log-probability of the sequence given a particular model)
is computed for each of the models occupying the two child nodes using the
forward algorithm (see Durbin et al. (1998)).
The competing likelihood values are then compared by computing
their Akaike weights (Johnson and Omland, 2004).
If one model is overwhelmingly more likely to have produced
the sequence than the other,
that child node is chosen and the classification is updated
to reflect the taxonomic ID stored at the node.
This classification procedure is repeated, continuing down the
tree until either an inconclusive result is returned by a
model comparison test (i.e. the Akaike weight is lower than
a pre-defined threshold, e.g. 0.9),
or a terminal leaf node is reached,
at which point a species-level classification is generally returned.
The function outputs a table with one row for each input sequence
Output table fields include "name" (the unique sequence identifier),
"taxID" (the taxonomic identification number from the taxonomy database),
"taxon" (the name of the taxon),
"rank" (the rank of the taxon, e.g. species, genus family, etc),
and "score" (the Akaike weight from the model selection procedure).
Note that the default behavior is for the Akaike weight to ‘decay’
as it moves down the tree, by computing the cumulative product of
all preceding Akaike weight values.
This minimizes the chance of type I taxon ID errors (overclassifications and misclassifications).
The output table also includes the higher taxonomic ranks specified in the
ranks
argument, and if metadata = TRUE
additional columns
are included called "path"
(the path of the sequence through the classification tree), "scores" (the
scores at each node through the tree, UTF-8-encoded),
and "reason" outlining why the recursive classification procedure was
terminated:
0 reached leaf node
1 failed to meet minimum score threshold at inner node
2 failed to meet minimum score of training sequences at inner node
3 sequence length shorter than minimum length of training sequences at inner node
4 sequence length exceeded maximum length of training sequences at inner node
5 nearest neighbor in training set does not belong to selected node (obsolete)
6 node is supported by too few sequences
7 reserved
8 sequence could not be translated (amino acids only)
9 translated sequence contains stop codon(s) (amino acids only)
Additional columns detailing the nearest neighbor search include "NNtaxID", "NNtaxon", "NNrank", and "NNdistance".
Value
a data.frame.
Author(s)
Shaun Wilkinson
References
Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, United Kingdom.
Johnson JB, Omland KS (2004) Model selection in ecology and evolution. Trends in Ecology and Evolution. 19, 101-108.
See Also
Examples
data(whales)
data(whale_taxonomy)
## use all sequences except first one to train the classifier
set.seed(999)
tree <- learn(whales[-1], db = whale_taxonomy, maxiter = 5, cores = 2)
## find predicted lineage for first sequence
classify(whales[1], tree)
## compare with actual lineage
taxID <- as.integer(gsub(".+\\|", "", names(whales)[1]))
get_lineage(taxID, whale_taxonomy)