R: Classifying 16S sequences

taxMachine {microclass}

R Documentation

Classifying 16S sequences

Description

Optimized classification of 16S sequence data.

Usage

taxMachine(
  sequence,
  model.in.memory = TRUE,
  model.on.disk = FALSE,
  verbose = TRUE,
  chunk.size = 10000
)

Arguments

`sequence`	Character vector with DNA sequences.
`model.in.memory`	Logical indicating if model should be cached in memory (default=TRUE).
`model.on.disk`	Logical or text, for reading/saving models, see Deatils below (default=FALSE).
`verbose`	Logical, if `TRUE` progress is reported during computations (default=TRUE).
`chunk.size`	The number of sequence to classify in each iteration of the loop (default=10000).

Details

This function provides optimized taxonomy classifications from 16S sequence data.

All sequences are classified to the genus level based on a Multinomial model (see multinomTrain) trained on the designed consensus taxonomy data set contax.trim found in the R-package microcontax. The word length K=8 has been used in the model.

To avoid saving fitted models in the package, a model is trained the first time you run taxMachine in an R session. This takes only a few seconds, and the result is cached for latter use if model.in.memory is TRUE.

If a path to an existing file with a trained model is supplied in model.on.disk, this Multinomial model is read from the file and used. If a path to a new file is supplied, the trained Multinomial model will be saved to that file. The default (model.on.disk=FALSE), means no files are read/saved, while model.on.disk=TRUE will attempt to load/save models from the microclass/extdata directory.

Both verbose and chunk.size are used to monitor the progress, which is nice when classifying huge data sets, since this will take some time.

Value

A data.frame with one row for each sequence. The columns are Genus, D.score, R.score and P.recognize.

Genus is the predicted genus for each sequence. Note that all sequences get a prediction, but may still be more or less reliable.

The D.score is a measure of how the predicted genus wins over all other genera in the race for being the chosen one. A large D.score means the winner stands out clearly, and we can be confident it is the correct genus. A D.score close to 0 means we have an uncertain classification. Only D.scores below 1.0, should be of any concern, see Liland et al (2016) for details.

The R.score is a measure of the models ability to recognize the sequence. The more negative the R.score gets, the more unusual the sequence is compared to the training set (the contax.trim data set). The P.recognize is a rough probability of seing an R.score this small, or smaller, given the training data. Thus, a very small P.recognize means the sequence is not really recognized, and the classification is worthless. A very negative R.score indicates either not 16S at all, many sequencing errors that has destroyed the read, or a completely new taxon never seen before. See Liland et al (2016) for details.

Author(s)

Lars Snipen and Kristian Hovde Liland

References

Liland, KH, Vinje, H, Snipen, L (2016). microclass - An R-package for 16S taxonomy classification. BMC Bioinformatics, xx:yy.

Examples

## Not run: 
data(small.16S)
tax.tab <- taxMachine(small.16S$Sequence)

## End(Not run)

[Package microclass version 1.2 Index]