aphid {aphid}R Documentation

The aphid package for analysis with profile hidden Markov models.

Description

aphid is an R package for the development and application of hidden Markov models and profile HMMs for biological sequence analysis. Functions are included for multiple and pairwise sequence alignment, model construction and parameter optimization, calculation of conditional probabilities (using the forward, backward and Viterbi algorithms), tree-based sequence weighting, sequence simulation, and file import/export compatible with the HMMER software package. The package has a wide variety of uses including database searching, gene-finding and annotation, phylogenetic analysis and sequence classification.

Details

The aphid package is based on the algorithms outlined in the book 'Biological sequence analysis: probabilistic models of proteins and nucleic acids' by Richard Durbin, Sean Eddy, Anders Krogh and Graeme Mitchison. This book is highly recommended for those wishing to develop a better understanding of HMMs and PHMMs, regardless of prior experience. Many of the examples in the function help pages are taken directly from the book, so that readers can learn to use the package as they work through the chapters.

There are also excellent rescources available for those wishing to use profile hidden Markov models outside of the R environment. The aphid package maintains compatibility with the HMMER software suite through the file input and output functions readPHMM and writePHMM. Those interested are further encouraged to check out the SAM software package, which also features a comprehensive suite of functions and tutorials.

The aphid package is designed to work in conjunction with the "DNAbin" and "AAbin" object types produced by the ape package (Paradis et al 2004, 2012). This is an essential piece of software for those using R for biological sequence analysis, and provides a binary coding format for nucleotides and amino acids that maximizes memory and speed efficiency. While aphid also works with standard character vectors and matrices, it may not recognize the DNA and amino acid amibguity codes and therefore is not guaranteed to treat them appropriately.

To maximize speed, the low-level dynamic programming functions such as Viterbi, forward and backward are written in C++ with the help of the Rcpp package (Eddelbuettel & Francois 2011). Note that R versions of these functions are also maintained for the purposes of debugging, experimentation and code interpretation.

Classes

The aphid package creates two primary object classes, "HMM" (hidden Markov models) and "PHMM" (profile hidden Markov models) with the functions deriveHMM and derivePHMM, respectively. These objects are lists consisting of emission and transition probability matrices (denoted E and A), vectors of non-position-specific background emission and transition probabilies (denoted qe and qa) and other model metadata. Objects of class "DPA" (dynammic programming array) are also generated by the Viterbi and forward/backward functions. These are primarily created for succinct console printing.

Functions

A breif description of the primary aphid functions are provided with links to their help pages below.

File import and export

Visualization

Model building and training

Sequence alignment and weighting

Conditional probabilities

Sequence simulation

Datasets

Author(s)

Shaun Wilkinson

References

Durbin R, Eddy SR, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, United Kingdom.

Eddelbuettel D, Francois R (2011) Rcpp: seamless R and C++ integration. Journal of Statistical Software 40, 1-18.

Finn RD, Clements J & Eddy SR (2011) HMMER web server: interactive sequence similarity searching. Nucleic Acids Research. 39, W29-W37. http://hmmer.org/.

HMMER: biosequence analysis using profile hidden Markov models. http://www.hmmer.org.

NCBI index of substitution matrices. ftp://ftp.ncbi.nih.gov/blast/matrices/.

Paradis E, Claude J, Strimmer K, (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20, 289-290.

Paradis E (2012) Analysis of Phylogenetics and Evolution with R (Second Edition). Springer, New York.


[Package aphid version 1.3.3 Index]