Viterbi {aphid} | R Documentation |
The Viterbi algorithm.
Description
The Viterbi
function finds the optimal path of a sequence through a HMM
or PHMM and returns its full (log) probability or log-odds score.
Usage
Viterbi(x, y, ...)
## S3 method for class 'PHMM'
Viterbi(
x,
y,
qe = NULL,
logspace = "autodetect",
type = "global",
odds = TRUE,
offset = 0,
windowspace = "all",
DI = FALSE,
ID = FALSE,
cpp = TRUE,
...
)
## S3 method for class 'HMM'
Viterbi(x, y, logspace = "autodetect", cpp = TRUE, ...)
## Default S3 method:
Viterbi(
x,
y,
type = "global",
d = 8,
e = 2,
residues = NULL,
S = NULL,
windowspace = "all",
offset = 0,
cpp = TRUE,
...
)
Arguments
x |
an object of class |
y |
a vector of mode "character" or "raw" (a "DNAbin" or "AAbin"
object) representing a single sequence hypothetically emitted by
the model in |
... |
additional arguments to be passed between methods. |
qe |
an optional named vector of background residue frequencies (only
applicable if x is a PHMM). If |
logspace |
logical indicating whether the emission and transition
probabilities of x are logged. If |
type |
character string indicating whether insert and delete states at the beginning and end of the path should count towards the final score ('global'; default), or not ('semiglobal'), or whether the highest scoring sub-path should be returned ('local'). |
odds |
logical, indicates whether the returned scores should be odds ratios (TRUE) or full logged probabilities (FALSE). |
offset |
column score offset to specify level of greediness. Defaults to -0.1 bits for PHMM x PHMM alignments (as recommended by Soding (2005)), and 0 otherwise. |
windowspace |
a two-element integer vector providing the search space for dynamic programming (see Wilbur & Lipman 1983 for details). The first element should be negative, and represent the lowermost diagonal of the dynammic programming array, and the second element should be positive, representing the leftmost diagonal. Alternatively, if the the character string "all" is passed (the default setting) the entire dynamic programming array will be computed. |
DI |
logical indicating whether delete-insert transitions should be allowed in the profile hidden Markov model (if applicable). Defaults to FALSE. |
ID |
logical indicating whether insert-delete transitions should be allowed in the profile hidden Markov model (if applicable). Defaults to FALSE. |
cpp |
logical, indicates whether the dynamic programming matrix should be filled using compiled C++ functions (default; many times faster). The FALSE option is primarily retained for bug fixing and experimentation. |
d |
gap opening penalty (in bits) for sequence vs. sequence alignment. Defaults to 8. |
e |
gap extension penalty (in bits) for sequence vs. sequence alignment. Defaults to 2. |
residues |
either NULL (default; emitted residues are automatically detected from the sequences), a case sensitive character vector specifying the residue alphabet, or one of the character strings "RNA", "DNA", "AA", "AMINO". Note that the default option can be slow for large lists of character vectors. |
S |
an optional scoring matrix with rows and columns named according to the residue alphabet. Only applicable when both x and y are sequences (Needleman-Wunch or Smith-Waterman alignments). Note that for Smith-Waterman local alignments, scores for mismatches should generally take negative values to avoid spurious alignments. If NULL default settings are used. Default scoring matrices are 'NUC.4.4' for For DNAbin objects, and 'MATCH' (matches are scored 1 and mismatches are scored -1) for AAbin objects and character sequences. |
Details
This function is a wrapper for a compiled C++ function that recursively fills a dynamic programming matrix with probabilities, and calculates the (logged) probability and optimal path of a sequence through a HMM or PHMM.
If x is a PHMM and y is a sequence, the path is represented as an integer vector containing zeros, ones and twos, where a zero represents a downward transition, a one represents a diagonal transition downwards and left, and a two represents a left transition in the dynamic programming matrix (see Durbin et al (1998) chapter 2.3). This translates to 0 = delete state, 1 = match state and 2 = insert state.
If x and y are both sequences, the function implements the Needleman-Wunch or Smith Waterman algorithm depending on the type of alignment specified. In this case, a zero in the path refers to x aligning to a gap in y, a one refers to a match, and a two refers to y aligning to a gap in x.
If x is a standard hidden Markov model (HMM) and y is a sequence, each integer in the path represents a state in the model. Note that the path elements can take values between 0 and one less than number of states, as in the C/C++ indexing style rather than R's.
For a thorough explanation of the backward, forward and Viterbi algorithms, see Durbin et al (1998) chapters 3.2 (HMMs) and 5.4 (PHMMs).
Value
an object of class "DPA"
, which is a list including the
score, the dynammic programming array, and the optimal path (an integer
vector, see details section).
Author(s)
Shaun Wilkinson
References
Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, United Kingdom.
Soding J (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics, 21, 951-960.
Wilbur WJ, Lipman DJ (1983) Rapid similarity searches of nucleic acid and protein data banks. Proc Natl Acad Sci USA, 80, 726-730.
See Also
Examples
## Viterbi algorithm for standard HMMs:
## The dishonest casino example from Durbin et al (1998) chapter 3.2
states <- c("Begin", "Fair", "Loaded")
residues <- paste(1:6)
### Define the transition probability matrix
A <- matrix(c(0, 0, 0, 0.99, 0.95, 0.1, 0.01, 0.05, 0.9), nrow = 3)
dimnames(A) <- list(from = states, to = states)
### Define the emission probability matrix
E <- matrix(c(rep(1/6, 6), rep(1/10, 5), 1/2), nrow = 2, byrow = TRUE)
dimnames(E) <- list(states = states[-1], residues = residues)
### Build and plot the HMM object
x <- structure(list(A = A, E = E), class = "HMM")
plot(x, main = "Dishonest casino HMM")
### Find optimal path of sequence
data(casino)
casino.DPA <- Viterbi(x, casino)
casino.DPA$score # full (log) prob of sequence given model = -538.8109
### Show optinal path path as indices
casino.DPA$path
### Show optimal path as character strings
rownames(x$E)[casino.DPA$path + 1]
##
## Needleman-Wunch pairwise sequence alignment:
## Pairwise protein alignment example from Durbin et al (1998) chapter 2.3
x <- c("H", "E", "A", "G", "A", "W", "G", "H", "E", "E")
y <- c("P", "A", "W", "H", "E", "A", "E")
Viterbi(x, y, d = 8, e = 2, type = "global")
###
## Viterbi algorithm for profile HMMs:
## Small globin alignment data from Durbin et al (1998) Figure 5.3
data(globins)
### Derive a profile hidden Markov model from the alignment
globins.PHMM <- derivePHMM(globins, residues = "AMINO", seqweights = NULL)
plot(globins.PHMM, main = "Profile hidden Markov model for globins")
### Simulate a random sequence from the model
suppressWarnings(RNGversion("3.5.0"))
set.seed(999)
simulation <- generate(globins.PHMM, size = 20)
simulation ## "F" "S" "A" "N" "N" "D" "W" "E"
### Calculate the odds of the optimal path of the sequence given the model
x <- Viterbi(globins.PHMM, simulation, odds = FALSE)
x # -23.07173
### Show dynammic programming array
x$array
### Show the optimal path as an integer vector
x$path
### Show the optimal path as either delete states, matches or insert states
c("D", "M", "I")[x$path + 1]
### Correctly predicted the actual path:
names(simulation)