R: The 'aphid' package for analysis with profile hidden Markov...

aphid {aphid}

R Documentation

The aphid package for analysis with profile hidden Markov models.

Description

aphid is an R package for the development and application of hidden Markov models and profile HMMs for biological sequence analysis. Functions are included for multiple and pairwise sequence alignment, model construction and parameter optimization, calculation of conditional probabilities (using the forward, backward and Viterbi algorithms), tree-based sequence weighting, sequence simulation, and file import/export compatible with the HMMER software package. The package has a wide variety of uses including database searching, gene-finding and annotation, phylogenetic analysis and sequence classification.

Details

The aphid package is based on the algorithms outlined in the book 'Biological sequence analysis: probabilistic models of proteins and nucleic acids' by Richard Durbin, Sean Eddy, Anders Krogh and Graeme Mitchison. This book is highly recommended for those wishing to develop a better understanding of HMMs and PHMMs, regardless of prior experience. Many of the examples in the function help pages are taken directly from the book, so that readers can learn to use the package as they work through the chapters.

There are also excellent rescources available for those wishing to use profile hidden Markov models outside of the R environment. The aphid package maintains compatibility with the HMMER software suite through the file input and output functions readPHMM and writePHMM.

The aphid package is designed to work in conjunction with the "DNAbin" and "AAbin" object types produced by the ape package (Paradis et al 2004, 2012). This is an essential piece of software for those using R for biological sequence analysis, and provides a binary coding format for nucleotides and amino acids that maximizes memory and speed efficiency. While aphid also works with standard character vectors and matrices, it may not recognize the DNA and amino acid amibguity codes and therefore is not guaranteed to treat them appropriately.

To maximize speed, the low-level dynamic programming functions such as Viterbi, forward and backward are written in C++ with the help of the Rcpp package (Eddelbuettel & Francois 2011). Note that R versions of these functions are also maintained for the purposes of debugging, experimentation and code interpretation.

Classes

The aphid package creates two primary object classes, "HMM" (hidden Markov models) and "PHMM" (profile hidden Markov models) with the functions deriveHMM and derivePHMM, respectively. These objects are lists consisting of emission and transition probability matrices (denoted E and A), vectors of non-position-specific background emission and transition probabilies (denoted qe and qa) and other model metadata. Objects of class "DPA" (dynammic programming array) are also generated by the Viterbi and forward/backward functions. These are primarily created for succinct console printing.

Functions

A breif description of the primary aphid functions are provided with links to their help pages below.

File import and export

readPHMM parses a HMMER text file into R and creates an object of class "PHMM"
writePHMM writes a "PHMM" object to a text file in HMMER v3 format

Visualization

plot.HMM plots a "PHMM" object as a cyclic directed graph
plot.PHMM plots a "PHMM" object as a directed graph with sequential modules consisting of match, insert and delete states

Model building and training

deriveHMM builds a "HMM" object from a list of training sequences
derivePHMM builds a "PHMM" object from a multiple sequence alignment or a list of non-aligned sequences
map optimizes profile hidden Markov model construction using the maximum a posteriori algorithm
train optimizes the parameters of a "HMM" or "PHMM" object using a list of training sequences

Sequence alignment and weighting

align performs a multiple sequence alignment
weight assigns weights to sequences

Conditional probabilities

Viterbi finds the optimal path of a sequence through a HMM or PHMM, and returns its log odds or probability given the model
forward finds the full probability of a sequence given a HMM or PHMM using the forward algorithm
backward finds the full probability of a sequence given a HMM or PHMM using the backward algorithm
posterior finds the position-specific posterior probability of a sequence given a HMM or PHMM

Sequence simulation

generate.HMM simulates a random sequence from an HMM
generate.PHMM simulates a random sequence from a PHMM

Datasets

substitution a collection of DNA and amino acid substitution matrices from NCBI including the PAM, BLOSUM, GONNET, DAYHOFF and NUC matrices
casino data from the dishonest casino example of Durbin et al (1998) chapter 3.2
globins Small globin alignment data from Durbin et al (1998) Figure 5.3

Author(s)

Shaun Wilkinson

References

Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, United Kingdom.

Eddelbuettel D, Francois R (2011) Rcpp: seamless R and C++ integration. Journal of Statistical Software 40, 1-18.

Finn RD, Clements J & Eddy SR (2011) HMMER web server: interactive sequence similarity searching. Nucleic Acids Research. 39, W29-W37. http://hmmer.org/.

HMMER: biosequence analysis using profile hidden Markov models. http://www.hmmer.org.

NCBI index of substitution matrices. ftp://ftp.ncbi.nih.gov/blast/matrices/.

Paradis E, Claude J, Strimmer K, (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20, 289-290.

Paradis E (2012) Analysis of Phylogenetics and Evolution with R (Second Edition). Springer, New York.

[Package aphid version 1.3.5 Index]