aphid {aphid} | R Documentation |
The aphid package for analysis with profile hidden Markov models.
Description
aphid is an R package for the development and application of hidden Markov models and profile HMMs for biological sequence analysis. Functions are included for multiple and pairwise sequence alignment, model construction and parameter optimization, calculation of conditional probabilities (using the forward, backward and Viterbi algorithms), tree-based sequence weighting, sequence simulation, and file import/export compatible with the HMMER software package. The package has a wide variety of uses including database searching, gene-finding and annotation, phylogenetic analysis and sequence classification.
Details
The aphid package is based on the algorithms outlined in the book 'Biological sequence analysis: probabilistic models of proteins and nucleic acids' by Richard Durbin, Sean Eddy, Anders Krogh and Graeme Mitchison. This book is highly recommended for those wishing to develop a better understanding of HMMs and PHMMs, regardless of prior experience. Many of the examples in the function help pages are taken directly from the book, so that readers can learn to use the package as they work through the chapters.
There are also excellent rescources available for those wishing to use profile hidden
Markov models outside of the R environment. The aphid package maintains
compatibility with the HMMER software suite
through the file input and output functions readPHMM
and
writePHMM
.
The aphid package is designed to work in conjunction with the "DNAbin"
and "AAbin" object types produced by the ape
package
(Paradis et al 2004, 2012). This is an essential piece of software for those
using R for biological sequence analysis, and provides a binary coding format
for nucleotides and amino acids that maximizes memory and speed efficiency.
While aphid also works with standard character vectors and matrices,
it may not recognize the DNA and amino acid amibguity codes and therefore is not
guaranteed to treat them appropriately.
To maximize speed, the low-level dynamic programming functions such
as Viterbi
, forward
and backward
are written in C++ with the help of the Rcpp
package (Eddelbuettel & Francois 2011).
Note that R versions of these functions are also maintained
for the purposes of debugging, experimentation and code interpretation.
Classes
The aphid package creates two primary object classes, "HMM"
(hidden Markov models) and "PHMM"
(profile hidden Markov models)
with the functions deriveHMM
and derivePHMM
, respectively.
These objects are lists consisting of emission and transition probability matrices
(denoted E and A), vectors of non-position-specific background emission and transition
probabilies (denoted qe and qa) and other model metadata.
Objects of class "DPA"
(dynammic programming array) are also generated
by the Viterbi and forward/backward functions.
These are primarily created for succinct console printing.
Functions
A breif description of the primary aphid functions are provided with links to their help pages below.
File import and export
-
readPHMM
parses a HMMER text file into R and creates an object of class"PHMM"
-
writePHMM
writes a"PHMM"
object to a text file in HMMER v3 format
Visualization
-
plot.HMM
plots a"PHMM"
object as a cyclic directed graph -
plot.PHMM
plots a"PHMM"
object as a directed graph with sequential modules consisting of match, insert and delete states
Model building and training
-
deriveHMM
builds a"HMM"
object from a list of training sequences -
derivePHMM
builds a"PHMM"
object from a multiple sequence alignment or a list of non-aligned sequences -
map
optimizes profile hidden Markov model construction using the maximum a posteriori algorithm -
train
optimizes the parameters of a"HMM"
or"PHMM"
object using a list of training sequences
Sequence alignment and weighting
Conditional probabilities
-
Viterbi
finds the optimal path of a sequence through a HMM or PHMM, and returns its log odds or probability given the model -
forward
finds the full probability of a sequence given a HMM or PHMM using the forward algorithm -
backward
finds the full probability of a sequence given a HMM or PHMM using the backward algorithm -
posterior
finds the position-specific posterior probability of a sequence given a HMM or PHMM
Sequence simulation
-
generate.HMM
simulates a random sequence from an HMM -
generate.PHMM
simulates a random sequence from a PHMM
Datasets
-
substitution
a collection of DNA and amino acid substitution matrices from NCBI including the PAM, BLOSUM, GONNET, DAYHOFF and NUC matrices -
casino
data from the dishonest casino example of Durbin et al (1998) chapter 3.2 -
globins
Small globin alignment data from Durbin et al (1998) Figure 5.3
Author(s)
Shaun Wilkinson
References
Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, United Kingdom.
Eddelbuettel D, Francois R (2011) Rcpp: seamless R and C++ integration. Journal of Statistical Software 40, 1-18.
Finn RD, Clements J & Eddy SR (2011) HMMER web server: interactive sequence similarity searching. Nucleic Acids Research. 39, W29-W37. http://hmmer.org/.
HMMER: biosequence analysis using profile hidden Markov models. http://www.hmmer.org.
NCBI index of substitution matrices. ftp://ftp.ncbi.nih.gov/blast/matrices/.
Paradis E, Claude J, Strimmer K, (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20, 289-290.
Paradis E (2012) Analysis of Phylogenetics and Evolution with R (Second Edition). Springer, New York.