taxMachine {microclass} | R Documentation |
Classifying 16S sequences
Description
Optimized classification of 16S sequence data.
Usage
taxMachine(
sequence,
model.in.memory = TRUE,
model.on.disk = FALSE,
verbose = TRUE,
chunk.size = 10000
)
Arguments
sequence |
Character vector with DNA sequences. |
model.in.memory |
Logical indicating if model should be cached in memory (default=TRUE). |
model.on.disk |
Logical or text, for reading/saving models, see Deatils below (default=FALSE). |
verbose |
Logical, if |
chunk.size |
The number of sequence to classify in each iteration of the loop (default=10000). |
Details
This function provides optimized taxonomy classifications from 16S sequence data.
All sequences are classified to the genus level based on a Multinomial model (see multinomTrain
)
trained on the designed consensus taxonomy data set contax.trim
found in
the R-package microcontax
. The word length K=8 has been used in the model.
To avoid saving fitted models in the package, a model is trained the first time you run taxMachine
in an R session.
This takes only a few seconds, and the result is cached for latter use if model.in.memory
is TRUE
.
If a path to an existing file with a trained model is supplied in model.on.disk
, this Multinomial model is read
from the file and used. If a path to a new file is supplied, the trained Multinomial model will be saved to that file.
The default (model.on.disk=FALSE
), means no files are read/saved, while model.on.disk=TRUE
will attempt to load/save models from the
microclass/extdata
directory.
Both verbose
and chunk.size
are used to monitor the progress, which is nice when classifying huge data sets,
since this will take some time.
Value
A data.frame
with one row for each sequence. The columns are Genus, D.score, R.score and P.recognize.
Genus is the predicted genus for each sequence. Note that all sequences get a prediction, but may still be more or less reliable.
The D.score is a measure of how the predicted genus wins over all other genera in the race for being the chosen one. A large D.score means the winner stands out clearly, and we can be confident it is the correct genus. A D.score close to 0 means we have an uncertain classification. Only D.scores below 1.0, should be of any concern, see Liland et al (2016) for details.
The R.score is a measure of the models ability to recognize the sequence. The more negative the R.score gets, the more
unusual the sequence is compared to the training set (the contax.trim
data set). The P.recognize
is a rough probability of seing an R.score this small, or smaller, given the training data. Thus, a very small P.recognize means
the sequence is not really recognized, and the classification is worthless. A very negative R.score indicates either not 16S at all,
many sequencing errors that has destroyed the read, or a completely new taxon never seen before. See Liland et al (2016) for
details.
Author(s)
Lars Snipen and Kristian Hovde Liland
References
Liland, KH, Vinje, H, Snipen, L (2016). microclass
- An
R-package for 16S taxonomy classification. BMC Bioinformatics, xx:yy.
See Also
Examples
## Not run:
data(small.16S)
tax.tab <- taxMachine(small.16S$Sequence)
## End(Not run)