R: Test of Purine-Pyrimidine Parity Based on Euclidean distance

agct.test {spgs}

R Documentation

Test of Purine-Pyrimidine Parity Based on Euclidean distance

Description

Performs a test proposed by Hart and Martínez (2011) for the equivalence of the relative frequencies of purines (A+G) and pyrimidines (C+T) in DNA sequences. It does this by checking whether or not the mononucleotide frequencies of a DNA sequence satisfy the relationship A+G=C+T.

Usage

agct.test(x, alg=c("exact", "simulate", "lower", "Lower", "upper"), n)

Arguments

`x`	either a vector containing the relative frequencies of each of the 4 nucleotides A, C, G, T, a character vector representing a DNA sequence in which each element contains a single nucleotide, or a DNA sequence stored using the SeqFastadna class from the seqinr package.
`alg`	the algorithm for computing the p-value. If set to “‘⁠simulate⁠’”, the p-value is obtained via Monte Carlo simulation. If set to “‘⁠lower⁠’”, an analytic lower bound on the p-value is computed. If set to “‘⁠upper⁠’”, an analytic upper bound on the p-value is computed. “‘⁠lower⁠’” and “‘⁠upper⁠’” are based on formulae in Hart and Martínez (2011). a Tighter (though unpublished) lower bound on the p-value may be obtained by specifying “‘⁠Lower⁠’”. If ‘⁠alg⁠’ is specified as “‘⁠exact⁠’” (the default value), the p-value for the test is computed exactly.
`n`	The number of replications to use for Monte Carlo simulation. If computationally feasible, a value >= 10000000 is recommended.

Details

The first argument may be a character vector representing a DNA sequence, a DNA sequence represented using the SeqFastadna class from the seqinr package, or a vector containing the relative frequencies of the A, C, G and T nucleic acids.

Let A, C, G and T denote the relative frequencies of the nucleotide bases appearing in a DNA sequence. This function carries out a statistical hypothesis test that the relative frequencies satisfy the relation A+G=C+T, or that purines \{A, G\} occur equally as often as pyrimidines \{C,T\} in a DNA sequence. The relationship can be rewritten as A-T=C-G, from which it is easy to see that the property being tested is a generalisation of Chargaff's second parity rule for mononucleotides, which states that A=T and C=G. The test is set up as follows:

H_0: A+G \neq C+T
H_1: A+G = C+T

The vector (A,C,G,T) is assumed to come from a Dirichlet(1,1,1,1) distribution on the 3-simplex under the null hypothesis.

The test statistic \eta_V is the Euclidean distance from the relative frequency vector (A,C,G,T) to the closest point in the square set \theta_V=\{(x,y,1/2-x,1/2-y) : 0 <= x,y <= 1/2\}, which divides the 3-simplex into two equal parts. \eta_V lies in the range [0,\sqrt{3/8}].

Value

A list with class "htest.ext" containing the following components:

`statistic`	the value of the test statistic.
`p.value`	the p-value of the test.
`method`	a character string indicating what type of test was performed.
`data.name`	a character string giving the name of the data.
`estimate`	the probability vector used to derive the test statistic.
`stat.desc`	a brief description of the test statistic.
`null`	the null hypothesis (`H_0`) of the test.
`alternative`	the alternative hypothesis (`H_1`) of the test.

Note

agct.test(x, alg="upper") is equivalent to ag.test(x, alg="simplex") except that the p-value computed using the formula for ‘⁠alg="upper"⁠’ is exact for the test statistic \eta_V^* used in ag.test, whereas it is merely an upper bound on the p-value for \eta_V.

Author(s)

Andrew Hart and Servet Martínez

References

Hart, A.G. and Martínez, S. (2011) Statistical testing of Chargaff's second parity rule in bacterial genome sequences. Stoch. Models 27(2), 1–46.

Examples

#Demonstration on real viral sequence
data(pieris)
agct.test(pieris)

#Simulate synthetic DNA sequence that does not exhibit Purine-Pyrimidine parity
trans.mat <- matrix(c(.4, .1, .4, .1, .2, .1, .6, .1, .4, .1, .3, .2, .1, .2, .4, .3), 
ncol=4, byrow=TRUE)
seq <- simulateMarkovChain(500000, trans.mat, states=c("a", "c", "g", "t"))
agct.test(seq)

[Package spgs version 1.0-4 Index]