R: Computes the p-value for a two-sample Kuiper test, given...

Kuiper2sample {KSgeneral}

R Documentation

Computes the p-value for a two-sample Kuiper test, given arbitrary data samples on the real line or on the circle with possibly repeated observations (i.e. ties)

Description

Computes the p-value, P(V_{m,n} \geq q), where V_{m,n} is the two-sample Kuiper test statistic, q = v, i.e. the observed value of the Kuiper statistic, computed based on two data samples \{x_{1},..., x_{m}\} and \{y_{1},..., y_{n}\} that may come from continuous, discrete or mixed distribution, i.e. they may have repeated observations (ties).

Usage

Kuiper2sample(x, y, conservative = F, tail = T)

Arguments

`x`	a numeric vector of data sample values `\{x_{1}, ..., x_{m}\}`
`y`	a numeric vector of data sample values `\{y_{1}, ..., y_{n}\}`
`conservative`	logical variable indicating whether ties should be considered. See ‘Details’ for the meaning.
`tail`	logical variable indicating whether a p-value, `P(V_{m,n} \ge q)` or one minus the p-value, `P(V_{m,n} < q)`, should be computed. By default, the p-value `P(V_{m,n} \ge q)` is computed. See ‘Details’ for the meaning.

Details

Given a pair of random samples, either on the real line or the circle, denoted by \bm{X}_m=(X_{1},..., X_{m}) and \bm{Y}_n=(Y_{1},..., Y_{n}), of sizes m and n with empirical cdfs F_{m}(t) and G_{n}(t) respectively, coming from some unknown cdfs F(x) and G(x). It is assumed that F(x) and G(x) could be either continuous, discrete or mixed, which means that repeated observations are allowed in the corresponding observed samples. The task is to test the null hypothesis H_0: F(x) = G(x) for all x, against the alternative hypothesis H_1: F(x)\neq G(x) for at least one x. The two-sample Kuiper goodness-of-fit statistic that is used to test this hypothesis is defined as:

\varsigma_{m,n} = \sup [F_{m}(t) - G_n(t)] - \inf [F_{m}(t) - G_n(t)].

For a particular realization of the pooled sample \bm{Z}_{m,n}=(X_{1},..., X_{m},Y_{1},..., Y_{n}), let there be k distinct values, a_1<a_2<...<a_k, in the ordered, pooled sample (z_1\leq z_2\leq \ldots \leq z_{m+n}), where k\leq m+n, and where m_i is the number of times a_i, i=1,\ldots,k appears in the pooled sample. The p-value is then defined as the probability

p=P\left(V_{m,n}\geq q\right),

where V_{m,n} is the two-sample Kuiper test statistic defined as \varsigma_{m,n}, for two samples \bm{X}'_m and \bm{Y}'_n of sizes m and n, randomly drawn from the pooled sample without replacement and q = v, the observed value of the statistic calculated based on the user provided data samples x and y. By default tail = T, the p-value is returned, otherwise 1-p is returned.

Note that, V_{m,n} is defined on the space \Omega of all possible pairs, C = \frac{(m+n)!}{m!n!} of edfs F_m(x,\omega) and G_n(x,\omega), \omega \in \Omega, that correspond to the pairs of samples \bm{X}'_m and \bm{Y}'_n, randomly drawn from, \bm{Z}_{m+n}, as follows. First, m observations are drawn at random without replacement, forming the first sample \bm{X}'_m, with corresponding edf, F_m(x,\omega). The remaining n observations are then assigned to the second sample \bm{Y}'_n, with corresponding edf G_n(x,\omega). Observations are then replaced back in \bm{Z}_{m+n} and re-sampling is continued until the occurrence of all the C possible pairs of edfs F_m(x,\omega) and G_n(x,\omega), \omega \in \Omega. The pairs of edf's may be coincident if there are ties in the data and each pair, F_m(x,\omega) and G_n(x,\omega) occurs with probability 1/C.

conservative is a logical variable whether the test should be conducted conservatively. By default, conservative = F, Kuiper2sample returns the p-value that is defined through the conditional probability above. However, when the user has a priori knowledge that both samples are from a continuous distribution even if ties are present, for example, repeated observations are caused by rounding errors, the value conservative = T should be assigned, since the conditional probability is no longer relevant. In this case, Kuiper2sample computes p-values for the Kuiper test assuming no ties are present, and returns a p-value which is an upper bound of the true p-value. Note that, if the null hypothesis is rejected using the calculated upper bound for the p-value, it should also be rejected with the true p-value.

Kuiper2sample calculates the exact p-value of the Kuiper test using an algorithm from Dimitrova, Jia, Kaishev (2024), which is based on extending the algorithm provided by Nikiforov (1994) and generalizing the method due to Maag and Stephens (1968) and Hirakawa (1973). If tail = F, Kuiper2sample calculates the complementary p-value 1-p. For the purpose, an exact algorithm which generalizes the method due to Nikiforov (1994) is implemented. Alternatively, if tail = T, a version of the Nikiforov's recurrence proposed recently by Viehmann (2021) is further incorporated, which computes directly the p-value, with up to 4 digits extra accuracy, but at up to 3 times higher computational cost. It is accurate and valid for arbitrary (possibly large) sample sizes. This algorithm ensures a total worst-case run-time of order O((mn)^{2}). When m and n have large greatest common divisor (an extreme case is m = n), it ensures a total worst-case run-time of order O((m)^{2}n).

Kuiper2sample is accurate and fast compared with the function based on the Monte Carlo simulation. Compared to the implementation using asymptotic method, Kuiper2sample allows data samples to come from continuous, discrete or mixed distribution (i.e. ties may appear), and is more accurate than asymptotic method when sample sizes are small.

Value

A list with class "htest" containing the following components:

`statistic`	the value of the test statistic `v`.
`p.value`	the p-value of the test.
`alternative`	a character string describing the alternative hypothesis.
`data.name`	a character string giving names of the data.

References

Maag, U. R., Stephens, M. A. (1968). The V_{NM} Two-Sample Test. The Annals of Mathematical Statistics, 39(3), 923-935.

Hirakawa, K. (1973). The two-sample Kuiper test. TRU Mathematics, 9, 99-118.

Nikiforov, A. M. (1994). "Algorithm AS 288: Exact Smirnov Two-Sample Tests for Arbitrary Distributions." Journal of the Royal Statistical Society. Series C (Applied Statistics), 43(1), 265–270.

Viehmann, T. (2021). Numerically more stable computation of the p-values for the two-sample Kolmogorov-Smirnov test. arXiv preprint arXiv:2102.08037.

Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2024). "The R functions KS2sample and Kuiper2sample: Efficient Exact Calculation of P-values of the Two-sample Kolmogorov-Smirnov and Kuiper Tests". submitted

Examples

##Computes discrete circular data
data1 <- c(rep(pi/2,30),rep(pi,30),rep(3*pi/2,30),rep(2*pi,30))
data2 <- c(rep(pi/2,50),rep(pi,40),rep(3*pi/2,10),rep(2*pi,50))
Kuiper2sample(data1, data2)

##The calculated p-value does not change with the choice of the original point
data3 <- c(rep(pi/2,30),rep(pi,30),rep(3*pi/2,30),rep(2*pi,30))
data4 <- c(rep(pi/2,50),rep(pi,50),rep(3*pi/2,40),rep(2*pi,10))
Kuiper2sample(data3, data4)

[Package KSgeneral version 2.0.2 Index]