multinomTrain {microclass} | R Documentation |
Training multinomial model
Description
Training the multinomial K-mer method on sequence data.
Usage
multinomTrain(sequence, taxon, K = 8, col.names = FALSE, n.pseudo = 100)
Arguments
sequence |
Character vector of 16S sequences. |
taxon |
Character vector of taxon labels for each sequence. |
K |
Word length (integer). |
col.names |
Logical indicating if column names should be added to the trained model matrix. |
n.pseudo |
Number of pseudo-counts to use (positive numerics, need not be integer). Special case -1 will only return word counts, not log-probabilities. |
Details
The training step of the multinomial method (Vinje et al, 2015) means counting K-mers
on all sequences and compute the multinomial probabilities for each K-mer for each unique taxon.
n.pseudo
pseudo-counts are added, divided equally over all K-mers, before probabilities
are estimated. The optimal choice of n.pseudo
will depend on K
and the
training data set. The default value n.pseudo=100
has proven good for K=8
and the
contax.trim
data set (see the microcontax
R-package).
Adding the actual K-mers as column names (col.names=TRUE
) will slow down the
computations.
The relative taxon sizes are also computed, and may be used as an empirical prior in the classification step (see "prior" below).
Value
A list with two elements. The first element is Method
, which is the text
"multinom"
in this case. The second element is Fitted
, which is a matrix
of probabilities with one row for each unique taxon
and one column for each possible word of
lengthK
. The sum of each row is 1.0. No probabilities are 0 if n.pseudo
>0.0.
The matrix Fitted
has an attribute attr("prior",)
, that contains the relative
taxon sizes.
Author(s)
Kristian Hovde Liland and Lars Snipen.
References
Vinje, H, Liland, KH, Almøy, T, Snipen, L. (2015). Comparing K-mer based methods for improved classification of 16S sequences. BMC Bioinformatics, 16:205.
See Also
Examples
# See examples for multinomClassify