R: Matrix Test of CSPR for Mononucleotides

chargaff1.test {spgs}

R Documentation

Matrix Test of CSPR for Mononucleotides

Description

Performs a test of Chargaff's second parity rule (CSPR) for mononucleotides on a genomic sequence using a 4X4 stochastic matrix estimated from the sequence. The test was proposed by Hart and Martínez (2011).

Usage

chargaff1.test(x, alg=c("table", "simulate", "upper"), n, no.p.value=FALSE)

Arguments

`x`	either a vector containing the relative frequencies of each of the 4 nucleotides A, C, G, T, a character vector representing a DNA sequence in which each element contains a single nucleotide, or a DNA sequence stored using the SeqFastadna class from the seqinr package.
`alg`	the algorithm for computing the p-value. If set to “‘⁠simulate⁠’”, the p-value is obtained via Monte Carlo simulation. If set to “‘⁠upper⁠’”, an analytic upper bound on the p-value is computed. “‘⁠upper⁠’” are based on formulae in Hart and Martínez (2011). If ‘⁠type⁠’ is specified as “‘⁠table⁠’” (the default value),the p-value for the test is obtained from a linear interpolation of a look-up table. See the note below for further details.
`n`	The number of replications to use for Monte Carlo simulation. If computationally feasible, a value >= 10000000 is recommended.
`no.p.value`	If ‘⁠TRUE⁠’, do not compute the p-value. The default is ‘⁠FALSE⁠’.

Details

The first argument may be a character vector representing a DNA sequence, a DNA sequence represented using the SeqFastadna class from the seqinr package, or a vector containing the relative frequencies of the A, C, G and T nucleic acids.

This function performs a test of Chargaff's second parity rule for mononucleotides based on a 4X4 stochastic matrix P estimated from the empirical dinucleotide distribution of a genomic sequence . The a,b) entry of P gives the empirical probability (relative frequency) that a nucleotide a is followed by a nucleotide b in the sequence. The test is set up as follows:

H_0: the sequence (or matrix P) does not comply with CSPR for mononucleotides
H_1: the sequence (or matrix P) complies with CSPR for mononucleotides

Value

A list with class "htest.ext" containing the following components:

`statistic`	the value of the test statistic.
`p.value`	the p-value of the test. Only included if no.p.value is FALSE.
`method`	a character string indicating what type of test was performed.
`data.name`	a character string giving the name of the data.
`f`	the 2-element vector used in calculating the test statistic.
`estimate`	the stochastic matrix `P` used to derive the test statistic.
`stat.desc`	a brief description of the test statistic.
`null`	the null hypothesis (`H_0`) of the test.
`alternative`	the alternative hypothesis (`H_1`) of the test.

Note

Currently, the look-up table that is employed when ‘⁠alg⁠’ is set to “‘⁠table⁠’” does not provide an accurate p-value when the statistic is smaller than 0.00806. Care should be taken when adjusting p-values for multiple testing.

The algebraically computed upper bound on the p-value obtained when ‘⁠alg⁠’ is set to “‘⁠upper⁠’” is not very tight and not suitable for real- world applications.

‘⁠no.p.value⁠’ suppresses computation of the p-value when it is set to ‘⁠TRUE⁠’. This may be useful wen using this function to help simulate the test statistic.

Author(s)

Andrew Hart and Servet Martínez

References

Hart, A.G. and Martínez, S. (2011) Statistical testing of Chargaff's second parity rule in bacterial genome sequences. Stoch. Models 27(2), 1–46.

Examples

#Demonstration on real bacterial sequence
data(nanoarchaeum)
chargaff1.test(nanoarchaeum)

#Simulate synthetic DNA sequence that does not satisfy Chargaff's second parity rule
trans.mat <- matrix(c(.4, .1, .4, .1, .2, .1, .6, .1, .4, .1, .3, .2, .1, .2, .4, .3), 
ncol=4, byrow=TRUE)
seq <- simulateMarkovChain(500000, trans.mat, states=c("a", "c", "g", "t"))
chargaff1.test(seq)

[Package spgs version 1.0-4 Index]