KS2sample_Rcpp {KSgeneral}R Documentation

R function calling the C++ routines that compute the p-value for a (weighted) two-sample Kolmogorov-Smirnov (KS) test, given an arbitrary positive weight function and arbitrary data samples with possibly repeated observations (i.e. ties)

Description

Function calling directly the C++ routines that compute the exact p-value P(D_{m,n} \ge q) for the (weighed) two-sample one- or two-sided Kolmogorov-Smirnov statistic, at a fixed q, q\in [0,1], given the sample sizes m and n, the vector of weights w_vec and the vector M containing the number of times each distinct observation is repeated in the pooled sample.

Usage

KS2sample_Rcpp(m, n, kind, M, q, w_vec, tol)

Arguments

m

the sample size of first tested sample.

n

the sample size of second tested sample.

kind

an integer value (= 1,2 or 3) which specified the alternative hypothesis. When = 1, the test is two-sided. When = 2 or 3, the test is one-sided. See ‘Details’ for the meaning of the possible values. Other value is invalid.

M

an integer-valued vector with k cells, where k denotes the number of distinct values in the ordered pooled sample of tested pair of samples(i.e. a_1<a_2<\ldots<a_k). M[i] is the number of times that a_i is repeated in the pooled sample. A valid M must have strictly positive integer values and have the sum of all cells equals to m+n.

q

numeric value between 0 and 1, at which the p-value P(D_{m,n}\geq q) is computed.

w_vec

a vector with m+n-1 cells, giving weights to each observation in the pooled sample. Valid w_vec must have m+n-1 cells and strictly positive value. See ‘Details’ for the meaning of values in each cell.

tol

the value of \epsilon for computing P(D_{m,n} >q- \epsilon), which is equivalent to P(D_{m,n} \geq q). Non-positive input (tol \leq 0) or large input (tol >1e-6) are replaced by tol = 1e-6. In cases when m and n have large least common multiple, a smaller value is highly recommended.

Details

Given a pair of random samples \bm{X}_m=(X_{1},..., X_{m}) and \bm{Y}_n=(Y_{1},..., Y_{n}) of sizes m and n with empirical cdfs F_{m}(t) and G_{n}(t) respectively, coming from some unknown cdfs F(x) and G(x). It is assumed that F(x) and G(x) could be either continuous, discrete or mixed, which means that repeated observations are allowed in the corresponding observed samples. The task is to test the null hypothesis H_0: F(x) = G(x) for all x, either against the alternative hypothesis H_1: F(x)\neq G(x) for at least one x, which corresponds to the two-sided test, or against H_1: F(x)> G(x) and H_1: F(x)< G(x) for at least one x, which corresponds to the two one-sided tests. The (weighted) two-sample Kolmogorov-Smirnov goodness-of-fit statistics that are used to test these hypotheses are generally defined as:

\Delta_{m,n} = \sup |F_{m}(t) - G_n(t)|W(E_{m+n}(t), \textnormal{ to test against the alternative } H_1: F(x)\neq G(x)

\Delta_{m,n}^{+} = \sup [F_{m}(t) - G_n(x)]W(E_{m+n}(t)), \textnormal{ to test against the alternative } H_1: F(x)> G(x)

\Delta_{m,n}^{-} = \sup [G_n(t) - F_{m}(x)]W(E_{m+n}(t)), \textnormal{ to test against the alternative } H_1: F(x)< G(x),

where E_{m+n}(t) is the empirical cdf of the pooled sample \bm{Z}_{m,n}=(X_{1},..., X_{m},Y_{1},..., Y_{n}), W( ) is a strictly positive weight function defined on [0,1].

w_vec[i] (0<i<m+n) is then equal to W(Z_i)=W(\frac{i}{m+n})(Z_i is the i-th smallest observation in the pooled sample \bm{Z}_{m,n}). Different value of w_vec specifies the weighted Kolmogorov-Smirnov test differently. For example, when w_vec=rep(1,m+n-1), KS2sample_Rcpp calculates the p-value of the unweighted two-sample Kolmogorov-Smirnov test, when w_vec = ((1:(m+n-1))*((m+n-1):1))^(-1/2), it calculates the p-value for the weighted two-sample Kolmogorov-Smirnov test with Anderson-Darling weight W(t) = 1/[t(1-t)]^{1/2}.

Possible values of kind are 1,2 and 3, which specify the alternative hypothesis, i.e. specify the test statistic to be either \Delta_{m,n}, \Delta_{m,n}^{+} or \Delta_{m,n}^{-} respectively.

The numeric array M specifies the number of repeated observations in the pooled sample. For a particular realization of the pooled sample \bm{Z}_{m,n}=(X_{1},..., X_{m},Y_{1},..., Y_{n}), let there be k distinct values, a_1<a_2<...<a_k, in the ordered, pooled sample (z_1\leq z_2\leq \ldots \leq z_{m+n}), where k\leq m+n, and where m_i=M[i] is the number of times a_i, i=1,\ldots,k appears in the pooled sample. The p-value is then defined as the probability

P\left(D_{m,n}\geq q\right),

where D_{m,n} is the two-sample Kolmogorov-Smirnov test statistic defined according to the value of weight and alternative, for two samples \bm{X}'_m and \bm{Y}'_n of sizes m and n, randomly drawn from the pooled sample without replacement, i.e. D_{m,n} is defined on the space \Omega (see further details in KS2sample), and q\in [0,1].

KS2sample_Rcpp implements an exact algorithm, extending the Fortran 77 subroutine due to Nikiforov (1994), an extended functionality by allowing more flexible choices of weight, as well as for large sample sizes. A version of the Nikiforov's recurrence proposed recently by Viehmann (2021) is further incorporated, which computes directly the p-value, with higher accuracy, giving up to 17 correct digits, but at up to 3 times higher computational cost than KS2sample_c_Rcpp. Compared with other known algorithms, it allows data samples to come from continuous, discrete or mixed distribution(i.e. ties may appear), and it is more efficient and more generally applicable for large sample sizes. This algorithm ensures a total worst-case run-time of order O(nm).

Value

Numeric value corresponding to P(D_{m,n}\geq q), given sample sizes m, n, M and w_vec. If the value of m, n are non-positive, or if the length of w_vec is not equal to m+n-1, then the function returns -1, the non-permitted value of M or non-permitted value inside w_vec returns -2, numerically unstable calculation returns -3.

Source

Based on the Fortran subroutine by Nikiforov (1994). See also Dimitrova, Jia, Kaishev (2024).

References

Paul L. Canner (1975). "A Simulation Study of One- and Two-Sample Kolmogorov-Smirnov Statistics with a Particular Weight Function". Journal of the American Statistical Association, 70(349), 209-211.

Nikiforov, A. M. (1994). "Algorithm AS 288: Exact Smirnov Two-Sample Tests for Arbitrary Distributions." Journal of the Royal Statistical Society. Series C (Applied Statistics), 43(1), 265–270.

Viehmann, T. (2021). Numerically more stable computation of the p-values for the two-sample Kolmogorov-Smirnov test. arXiv preprint arXiv:2102.08037.

Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2024). "The R functions KS2sample and Kuiper2sample: Efficient Exact Calculation of P-values of the Two-sample Kolmogorov-Smirnov and Kuiper Tests". submitted

Examples

## Computing the unweighted two-sample Kolmogorov-Smirnov test
## Example see in Nikiforov (1994)

m <- 120
n <- 150
kind <- 1
q <- 0.1
M <- c(80,70,40,80)
w_vec <- rep(1,m+n-1)
tol <- 1e-6
KS2sample_Rcpp(m, n, kind, M, q, w_vec, tol)

kind <- 2
KS2sample_Rcpp(m, n, kind, M, q, w_vec, tol)

## Computing the weighted two-sample Kolmogorov-Smirnov test
## with Anderson-Darling weight
kind <- 3
w_vec <- ((1:(m+n-1))*((m+n-1):1))^(-1/2)
KS2sample_Rcpp(m, n, kind, M, q, w_vec, tol)

[Package KSgeneral version 2.0.2 Index]