agct.test {spgs} | R Documentation |
Test of Purine-Pyrimidine Parity Based on Euclidean distance
Description
Performs a test proposed by Hart and Martínez (2011) for the equivalence of the
relative frequencies of purines (A+G
) and pyrimidines (C+T
) in DNA
sequences. It does this by checking whether or not the mononucleotide
frequencies of a DNA sequence satisfy the relationship A+G=C+T.
Usage
agct.test(x, alg=c("exact", "simulate", "lower", "Lower", "upper"), n)
Arguments
x |
either a vector containing the relative frequencies of each of the 4 nucleotides A, C, G, T, a character vector representing a DNA sequence in which each element contains a single nucleotide, or a DNA sequence stored using the SeqFastadna class from the seqinr package. |
alg |
the algorithm for computing the p-value. If set to “‘simulate’”, the p-value is obtained via Monte Carlo simulation. If set to “‘lower’”, an analytic lower bound on the p-value is computed. If set to “‘upper’”, an analytic upper bound on the p-value is computed. “‘lower’” and “‘upper’” are based on formulae in Hart and Martínez (2011). a Tighter (though unpublished) lower bound on the p-value may be obtained by specifying “‘Lower’”. If ‘alg’ is specified as “‘exact’” (the default value), the p-value for the test is computed exactly. |
n |
The number of replications to use for Monte Carlo simulation. If computationally feasible, a value >= 10000000 is recommended. |
Details
The first argument may be a character vector representing a DNA sequence, a DNA sequence represented using the SeqFastadna class from the seqinr package, or a vector containing the relative frequencies of the A, C, G and T nucleic acids.
Let A, C, G and T denote the relative frequencies of the nucleotide bases
appearing in a DNA sequence. This function carries out a statistical hypothesis
test that the relative frequencies satisfy the relation A+G=C+T
, or that
purines \{A, G\}
occur equally as often as pyrimidines \{C,T\}
in a DNA sequence.
The relationship can be rewritten as A-T=C-G
, from which it is easy to see
that the property being tested is a generalisation of Chargaff's second parity
rule for mononucleotides, which states that A=T
and C=G
. The test is
set up as follows:
H_0
: A+G \neq C+T
H_1
: A+G = C+T
The vector (A,C,G,T)
is assumed to come from a Dirichlet(1,1,1,1)
distribution on the 3-simplex under the null hypothesis.
The test statistic \eta_V
is the Euclidean distance from the
relative frequency vector (A,C,G,T)
to the closest point in the square set
\theta_V=\{(x,y,1/2-x,1/2-y) : 0 <= x,y <= 1/2\}
, which divides the 3-simplex into two equal parts.
\eta_V
lies in the range [0,\sqrt{3/8}]
.
Value
A list with class "htest.ext" containing the following components:
statistic |
the value of the test statistic. |
p.value |
the p-value of the test. |
method |
a character string indicating what type of test was performed. |
data.name |
a character string giving the name of the data. |
estimate |
the probability vector used to derive the test statistic. |
stat.desc |
a brief description of the test statistic. |
null |
the null hypothesis ( |
alternative |
the alternative hypothesis ( |
Note
agct.test(x, alg="upper")
is equivalent to ag.test(x,
alg="simplex")
except that the p-value computed using the formula for
‘alg="upper"’ is exact for the test statistic \eta_V^*
used in
ag.test
, whereas it is merely an upper bound on the p-value for
\eta_V
.
Author(s)
Andrew Hart and Servet Martínez
References
Hart, A.G. and Martínez, S. (2011) Statistical testing of Chargaff's second parity rule in bacterial genome sequences. Stoch. Models 27(2), 1–46.
See Also
chargaff0.test
, chargaff1.test
,
chargaff2.test
, ag.test
,
chargaff.gibbs.test
Examples
#Demonstration on real viral sequence
data(pieris)
agct.test(pieris)
#Simulate synthetic DNA sequence that does not exhibit Purine-Pyrimidine parity
trans.mat <- matrix(c(.4, .1, .4, .1, .2, .1, .6, .1, .4, .1, .3, .2, .1, .2, .4, .3),
ncol=4, byrow=TRUE)
seq <- simulateMarkovChain(500000, trans.mat, states=c("a", "c", "g", "t"))
agct.test(seq)