R: Test of Purine-Pyrimidine Parity Based on Purine Count

ag.test {spgs}

R Documentation

Test of Purine-Pyrimidine Parity Based on Purine Count

Description

Performs a test proposed by Hart and Martínez (2011) for the equivalence of the relative frequencies of purines (A+G) and pyrimidines (C+T) in DNA sequences. It does this by checking whether or not the mononucleotide frequencies of a DNA sequence satisfy the relationship A+G=C+T.

Usage

ag.test(x, type=c("interval", "simplex"))

Arguments

x

either a vector containing the relative frequencies of each of the 4 nucleotides A, C, G, T, a character vector representing a DNA sequence in which each element contains a single nucleotide, or a DNA sequence stored using the SeqFastadna class from the seqinr package.

type

Specifies one of two possible tests to perform, both of which are based on the same test statistic, but assuming different forms of the Dirichlet distribution under the null. “‘⁠simplex⁠’” assumes a Dirichlet(1,1,1,1) distribution on the 3-simplex while “‘⁠interval⁠’” assumes a Dirichlet(1,1) (uniform) distribution on the unit interval. The default is “‘⁠interval⁠’”.

Details

The first argument may be a character vector representing a DNA sequence, a DNA sequence represented using the SeqFastadna class from the seqinr package, or a vector containing the relative frequencies of the A, C, G and T nucleic acids.

Let A, C, G and T denote the relative frequencies of the nucleotide bases appearing in a DNA sequence. This function carries out a statistical hypothesis test that the relative frequencies satisfy the relation A+G=C+T, or that purines \{A, G\} occur equally as often as pyrimidines \{C,T\} in a DNA sequence. The relationship can be rewritten as A-T=C-G, from which it is easy to see that the property being tested is a generalisation of Chargaff's second parity rule for mononucleotides, which states that A=T and C=G. The test is set up as follows:

H_0: A+G \neq C+T
H_1: A+G = C+T

If ‘⁠type⁠’ is set to “‘⁠simplex⁠’”, the vector (A,C,G,T) is assumed to come from a Dirichlet(1,1,1,1) distribution on the 3-simplex under the null hypothesis. Otherwise, if ‘⁠type⁠’ is set to “‘⁠interval⁠’”, it is assumed under the null hypothesis that (A+G,C+T) ~ Dirichlet(1,1) or, in other words, A+G and C+T are uniformly distributed on the unit interval and satisfy A+G+C+T=1.

In both cases, the test statistic is \eta_V^* = |A+G-0.5|.

Value

A list with class "htest.ext" containing the following components:

`statistic`	the value of the test statistic.
`p.value`	the p-value of the test. Only included if ‘⁠no.p.value⁠’ is ‘⁠FALSE⁠’.
`method`	a character string indicating what type of test was performed.
`data.name`	a character string giving the name of the data.
`estimate`	the probability vector used to derive the test statistic.
`stat.desc`	a brief description of the test statistic.
`null`	the null hypothesis (`H_0`) of the test.
`alternative`	the alternative hypothesis (`H_1`) of the test.

Author(s)

Andrew Hart and Servet Martínez

References

Hart, A.G. and Martínez, S. (2011) Statistical testing of Chargaff's second parity rule in bacterial genome sequences. Stoch. Models 27(2), 1–46.

Examples

#Demonstration on real viral sequence
data(pieris)
ag.test(pieris)
ag.test(pieris, type="simplex")

#Simulate synthetic DNA sequence that does not exhibit Purine-Pyrimidine parity
trans.mat <- matrix(c(.4, .1, .4, .1, .2, .1, .6, .1, .4, .1, .3, .2, .1, .2, .4, .3), 
ncol=4, byrow=TRUE)
seq <- simulateMarkovChain(500000, trans.mat, states=c("a", "c", "g", "t"))
ag.test(seq)

[Package spgs version 1.0-4 Index]