Kuiper2sample_Rcpp {KSgeneral}R Documentation

R function calling the C++ routines that compute the p-value for a (unweighted) two-sample Kuiper test, given arbitrary data samples on the real line or on the circle with possibly repeated observations (i.e. ties)

Description

Function calling directly the C++ routines that compute the exact p-value P(V_{m,n} \ge q) for the two-sample Kuiper test, at a fixed q, q\in [0,2], given the sample sizes m, n and the vector M containing the number of times each distinct observation is repeated in the pooled sample.

Usage

Kuiper2sample_Rcpp(m, n, M, q)

Arguments

m

the sample size of first tested sample.

n

the sample size of second tested sample.

M

an integer-valued vector with k cells, where k denotes the number of distinct values in the ordered pooled sample of tested pair of samples(i.e. a_1<a_2<\ldots<a_k). M[i] is the number of times that a_i is repeated in the pooled sample. A valid M must have strictly positive integer values and have the sum of all cells equals to m+n.

q

numeric value between 0 and 2, at which the p-value P(V_{m,n}\ge q) is computed.

Details

Given a pair of random samples, either on the real line or the circle, denoted by \bm{X}_m=(X_{1},..., X_{m}) and \bm{Y}_n=(Y_{1},..., Y_{n}), of sizes m and n with empirical cdfs F_{m}(t) and G_{n}(t) respectively, coming from some unknown cdfs F(x) and G(x). It is assumed that F(x) and G(x) could be either continuous, discrete or mixed, which means that repeated observations are allowed in the corresponding observed samples. The task is to test the null hypothesis H_0: F(x) = G(x) for all x, against the alternative hypothesis H_1: F(x)\neq G(x) for at least one x. The two-sample Kuiper goodness-of-fit statistic that is used to test this hypothesis is defined as:

\varsigma_{m,n} = \sup [F_{m}(t) - G_n(t)] - \inf [F_{m}(t) - G_n(t)].

The numeric array M specifies the number of repeated observations in the pooled sample. For a particular realization of the pooled sample \bm{Z}_{m,n}=(X_{1},..., X_{m},Y_{1},..., Y_{n}), let there be k distinct values, a_1<a_2<...<a_k, in the ordered, pooled sample (z_1\leq z_2\leq \ldots \leq z_{m+n}), where k\leq m+n, and where m_i = M[i] is the number of times a_i, i=1,\ldots,k appears in the pooled sample. The p-value is then defined as the probability

P\left(V_{m,n}\geq q\right),

where V_{m,n} is the two-sample Kuiper test statistic defined as \varsigma_{m,n}, for two samples \bm{X}'_m and \bm{Y}'_n of sizes m and n, randomly drawn from the pooled sample without replacement, i.e. V_{m,n} is defined on the space \Omega (see further details in Kuiper2sample), and q\in [0,2].

Kuiper2sample_Rcpp implements an algorithm from Dimitrova, Jia, Kaishev (2024), that is based on extending the algorithm provided by Nikiforov (1994) and generalizing the method due to Maag and Stephens (1968) and Hirakawa (1973). A version of the Nikiforov's recurrence proposed recently by Viehmann (2021) is further incorporated, which computes directly the p-value, with up to 4 digits extra accuracy, but at up to 3 times higher computational cost than Kuiper2sample_c_Rcpp. It is accurate and valid for arbitrary (possibly large) sample sizes. This algorithm ensures a total worst-case run-time of order O((mn)^{2}). When m and n have large greatest common divisor (an extreme case is m = n), it ensures a total worst-case run-time of order O((m)^{2}n).

Other known implementations for the two-sample Kuiper test mainly use the approximation method or Monte Carlo simulation (See also Kuiper2sample). The former method is invalid for data with ties and often gives p-values with large errors when sample sizes are small, the latter method is usually slow and inaccurate. Compared with other known algorithms, Kuiper2sample_Rcpp allows data samples to come from continuous, discrete or mixed distribution (i.e. ties may appear), and is more accurate and generally applicable for large sample sizes.

Value

Numeric value corresponding to P(V_{m,n}\geq q), given sample sizes m, n and M. If the value of m, n are non-positive, or their least common multiple exceeds the limit 2147483647, then the function returns -1, the non-permitted value of M returns -2, numerically unstable calculation returns -3.

References

Maag, U. R., Stephens, M. A. (1968). The V_{NM} Two-Sample Test. The Annals of Mathematical Statistics, 39(3), 923-935.

Hirakawa, K. (1973). The two-sample Kuiper test. TRU Mathematics, 9, 99-118.

Nikiforov, A. M. (1994). "Algorithm AS 288: Exact Smirnov Two-Sample Tests for Arbitrary Distributions." Journal of the Royal Statistical Society. Series C (Applied Statistics), 43(1), 265–270.

Viehmann, T. (2021). Numerically more stable computation of the p-values for the two-sample Kolmogorov-Smirnov test. arXiv preprint arXiv:2102.08037.

Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2024). "The R functions KS2sample and Kuiper2sample: Efficient Exact Calculation of P-values of the Two-sample Kolmogorov-Smirnov and Kuiper Tests". submitted

Examples

## Computing the unweighted two-sample Kolmogorov-Smirnov test
## Example see in Nikiforov (1994)

m <- 120
n <- 150
q <- 0.183333333
M <- c(80,70,40,80)
Kuiper2sample_Rcpp(m, n, M, q)

[Package KSgeneral version 2.0.2 Index]