R: The Viterbi algorithm.

Viterbi {aphid}

R Documentation

The Viterbi algorithm.

Description

The Viterbi function finds the optimal path of a sequence through a HMM or PHMM and returns its full (log) probability or log-odds score.

Usage

Viterbi(x, y, ...)

## S3 method for class 'PHMM'
Viterbi(
  x,
  y,
  qe = NULL,
  logspace = "autodetect",
  type = "global",
  odds = TRUE,
  offset = 0,
  windowspace = "all",
  DI = FALSE,
  ID = FALSE,
  cpp = TRUE,
  ...
)

## S3 method for class 'HMM'
Viterbi(x, y, logspace = "autodetect", cpp = TRUE, ...)

## Default S3 method:
Viterbi(
  x,
  y,
  type = "global",
  d = 8,
  e = 2,
  residues = NULL,
  S = NULL,
  windowspace = "all",
  offset = 0,
  cpp = TRUE,
  ...
)

Arguments

`x`	an object of class `HMM` or `PHMM`. Optionally, both x and y can be sequences (character vectors or DNAbin/AAbin objects), in which case the operation becomes either the Needleman-Wunch (global algnment) or Smith-Waterman (local alignment) algorithm.
`y`	a vector of mode "character" or "raw" (a "DNAbin" or "AAbin" object) representing a single sequence hypothetically emitted by the model in `x`. Optionally, both x and y can be profile hidden Markov models (object class "PHMM"), in which case the sum of log-odds algorithm of Soding (2005) is used.
`...`	additional arguments to be passed between methods.
`qe`	an optional named vector of background residue frequencies (only applicable if x is a PHMM). If `qe = NULL` the function looks for a qe vector as an attribute of the PHMM. If these are not available equal background residue frequencies are assumed.
`logspace`	logical indicating whether the emission and transition probabilities of x are logged. If `logspace = "autodetect"` (default setting), the function will automatically detect if the probabilities are logged, returning an error if inconsistencies are found. Note that choosing the latter option increases the computational overhead; therefore specifying `TRUE` or `FALSE` can reduce the running time.
`type`	character string indicating whether insert and delete states at the beginning and end of the path should count towards the final score ('global'; default), or not ('semiglobal'), or whether the highest scoring sub-path should be returned ('local').
`odds`	logical, indicates whether the returned scores should be odds ratios (TRUE) or full logged probabilities (FALSE).
`offset`	column score offset to specify level of greediness. Defaults to -0.1 bits for PHMM x PHMM alignments (as recommended by Soding (2005)), and 0 otherwise.
`windowspace`	a two-element integer vector providing the search space for dynamic programming (see Wilbur & Lipman 1983 for details). The first element should be negative, and represent the lowermost diagonal of the dynammic programming array, and the second element should be positive, representing the leftmost diagonal. Alternatively, if the the character string "all" is passed (the default setting) the entire dynamic programming array will be computed.
`DI`	logical indicating whether delete-insert transitions should be allowed in the profile hidden Markov model (if applicable). Defaults to FALSE.
`ID`	logical indicating whether insert-delete transitions should be allowed in the profile hidden Markov model (if applicable). Defaults to FALSE.
`cpp`	logical, indicates whether the dynamic programming matrix should be filled using compiled C++ functions (default; many times faster). The FALSE option is primarily retained for bug fixing and experimentation.
`d`	gap opening penalty (in bits) for sequence vs. sequence alignment. Defaults to 8.
`e`	gap extension penalty (in bits) for sequence vs. sequence alignment. Defaults to 2.
`residues`	either NULL (default; emitted residues are automatically detected from the sequences), a case sensitive character vector specifying the residue alphabet, or one of the character strings "RNA", "DNA", "AA", "AMINO". Note that the default option can be slow for large lists of character vectors.
`S`	an optional scoring matrix with rows and columns named according to the residue alphabet. Only applicable when both x and y are sequences (Needleman-Wunch or Smith-Waterman alignments). Note that for Smith-Waterman local alignments, scores for mismatches should generally take negative values to avoid spurious alignments. If NULL default settings are used. Default scoring matrices are 'NUC.4.4' for For DNAbin objects, and 'MATCH' (matches are scored 1 and mismatches are scored -1) for AAbin objects and character sequences.

Details

This function is a wrapper for a compiled C++ function that recursively fills a dynamic programming matrix with probabilities, and calculates the (logged) probability and optimal path of a sequence through a HMM or PHMM.

If x is a PHMM and y is a sequence, the path is represented as an integer vector containing zeros, ones and twos, where a zero represents a downward transition, a one represents a diagonal transition downwards and left, and a two represents a left transition in the dynamic programming matrix (see Durbin et al (1998) chapter 2.3). This translates to 0 = delete state, 1 = match state and 2 = insert state.

If x and y are both sequences, the function implements the Needleman-Wunch or Smith Waterman algorithm depending on the type of alignment specified. In this case, a zero in the path refers to x aligning to a gap in y, a one refers to a match, and a two refers to y aligning to a gap in x.

If x is a standard hidden Markov model (HMM) and y is a sequence, each integer in the path represents a state in the model. Note that the path elements can take values between 0 and one less than number of states, as in the C/C++ indexing style rather than R's.

For a thorough explanation of the backward, forward and Viterbi algorithms, see Durbin et al (1998) chapters 3.2 (HMMs) and 5.4 (PHMMs).

Value

an object of class "DPA", which is a list including the score, the dynammic programming array, and the optimal path (an integer vector, see details section).

Author(s)

Shaun Wilkinson

References

Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, United Kingdom.

Soding J (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics, 21, 951-960.

Wilbur WJ, Lipman DJ (1983) Rapid similarity searches of nucleic acid and protein data banks. Proc Natl Acad Sci USA, 80, 726-730.

Examples

  ## Viterbi algorithm for standard HMMs:
  ## The dishonest casino example from Durbin et al (1998) chapter 3.2
  states <- c("Begin", "Fair", "Loaded")
  residues <- paste(1:6)
  ### Define the transition probability matrix
  A <- matrix(c(0, 0, 0, 0.99, 0.95, 0.1, 0.01, 0.05, 0.9), nrow = 3)
  dimnames(A) <- list(from = states, to = states)
  ### Define the emission probability matrix
  E <- matrix(c(rep(1/6, 6), rep(1/10, 5), 1/2), nrow = 2, byrow = TRUE)
  dimnames(E) <- list(states = states[-1], residues = residues)
  ### Build and plot the HMM object
  x <- structure(list(A = A, E = E), class = "HMM")
  plot(x, main = "Dishonest casino HMM")
  ### Find optimal path of sequence
  data(casino)
  casino.DPA <- Viterbi(x, casino)
  casino.DPA$score # full (log) prob of sequence given model = -538.8109
  ### Show optinal path path as indices
  casino.DPA$path
  ### Show optimal path as character strings
  rownames(x$E)[casino.DPA$path + 1]
  ##
  ## Needleman-Wunch pairwise sequence alignment:
  ## Pairwise protein alignment example from Durbin et al (1998) chapter 2.3
  x <- c("H", "E", "A", "G", "A", "W", "G", "H", "E", "E")
  y <- c("P", "A", "W", "H", "E", "A", "E")
  Viterbi(x, y,  d = 8, e = 2, type = "global")
  ###
  ## Viterbi algorithm for profile HMMs:
  ## Small globin alignment data from Durbin et al (1998) Figure 5.3
  data(globins)
  ### Derive a profile hidden Markov model from the alignment
  globins.PHMM <- derivePHMM(globins, residues = "AMINO", seqweights = NULL)
  plot(globins.PHMM, main = "Profile hidden Markov model for globins")
  ### Simulate a random sequence from the model
  suppressWarnings(RNGversion("3.5.0"))
  set.seed(999)
  simulation <- generate(globins.PHMM, size = 20)
  simulation ## "F" "S" "A" "N" "N" "D" "W" "E"
  ### Calculate the odds of the optimal path of the sequence given the model
  x <- Viterbi(globins.PHMM, simulation, odds = FALSE)
  x # -23.07173
  ### Show dynammic programming array
  x$array
  ### Show the optimal path as an integer vector
  x$path
  ### Show the optimal path as either delete states, matches or insert states
  c("D", "M", "I")[x$path + 1]
  ### Correctly predicted the actual path:
  names(simulation)

[Package aphid version 1.3.5 Index]