R: Computing P-Values of the One-Sample K-S Test and the...

KSgeneral-package {KSgeneral}

R Documentation

Computing P-Values of the One-Sample K-S Test and the Two-Sample K-S and Kuiper Tests for (Dis)Continuous Null Distribution

Description

This package computes p-values of the one-sample and two-sample Kolmogorov-Smirnov (KS) tests and the two-sample Kuiper test.

The one-sample two-sided Kolmogorov-Smirnov (KS) statistic is one of the most popular goodness-of-fit test statistics that is used to measure how well the distribution of a random sample agrees with a prespecified theoretical distribution. Given a random sample \{X_{1},..., X_{n}\} of size n with an empirical cdf F_{n}(x), the two-sided KS statistic is defined as D_{n} = \sup | F_{n}(x) - F(x) | , where F(x) is the cdf of the prespecified theoretical distribution under the null hypothesis H_{0}, that \{ X_{1},..., X_{n} \} comes from F(x). The package KSgeneral implements a novel, accurate and efficient Fast Fourier Transform (FFT)-based method, referred as Exact-KS-FFT method to compute the complementary cdf, P(D_{n} \ge q), at a fixed q\in [0, 1] for a given (hypothezied) purely discrete, mixed or continuous underlying cdf F(x), and arbitrary, possibly very large sample size n. A plot of the complementary cdf P(D_{n} \ge q), 0 \le q \le 1, can also be produced.

In other words, the package computes the p-value, P(D_{n} \ge q) for any fixed critical level q\in [0, 1]. If an observed (data) sample, \{x_{1},..., x_{n}\} is supplied, KSgeneral computes the p-value P(D_{n} \ge d_{n}), where d_{n} is the value of the KS test statistic computed based on \{x_{1},..., x_{n}\}. One can also compute the (complementary) cdf for the one-sided KS statistics D_{n}^{-} or D_{n}^{+} (cf., Dimitrova, Kaishev, Tan (2020)) by appropriately specifying correspondingly A_{i} = 0 for all i or B_{i} = 1 for all i, in the function ks_c_cdf_Rcpp.

The two-sample Kolmogorov-Smirnov (KS) and the Kuiper statistics are widely used to test the null hypothesis (H_0) that two data samples come from the same underlying distribution. Given a pair of random samples \bm{X}_m=(X_{1},..., X_{m}) and \bm{Y}_n=(Y_{1},..., Y_{n}) of sizes m and n with empirical cdfs F_{m}(t) and G_{n}(t) respectively, coming from unknown CDFs F(x) and G(x). It is assumed that F(x) and G(x) could be either continuous, discrete or mixed, which means that repeated observations are allowed in the corresponding observed samples. We want to test the null hypothesis H_0: F(x) = G(x) for all x, either against the alternative hypothesis H_1: F(x)\neq G(x) for at least one x, which corresponds to the two-sided test, or against H_1: F(x)> G(x) and H_1: F(x)< G(x) for at least one x, which corresponds to the two one-sided tests. The (weighted) two-sample Kolmogorov-Smirnov goodness-of-fit statistics that are used to test these hypotheses are generally defined as:

\Delta_{m,n} = \sup |F_{m}(t) - G_n(t)|W(E_{m+n}(t), \textnormal{ to test against the alternative } H_1: F(x)\neq G(x)

\Delta_{m,n}^{+} = \sup [F_{m}(t) - G_n(x)]W(E_{m+n}(t)), \textnormal{ to test against the alternative } H_1: F(x)> G(x)

\Delta_{m,n}^{-} = \sup [G_n(t) - F_{m}(x)]W(E_{m+n}(t)), \textnormal{ to test against the alternative } H_1: F(x)< G(x)

where E_{m+n}(t) is the empirical cdf of the pooled sample \bm{Z}_{m,n}=(X_{1},..., X_{m},Y_{1},..., Y_{n}), W( ) is a strictly positive weight function defined on (0,1). KSgeneral implements an exact algorithm which is an extension of the Fortran 77 subroutine due to Nikiforov (1994), to calculate the exact p-value P(D_{m,n} \ge q), where q\in [0,1] and D_{m,n} is the two-sample Kolmogorov-Smirnov goodness-of-fit test defined on the space \Omega of all possible \frac{(m+n)!}{m!n!} pairs of samples, \bm{X}'_m and \bm{Y}'_n of sizes m and n, that are randomly drawn from the pooled sample \bm{Z}_{m+n} without replacement. If two data samples \{x_1,\ldots,x_m\} and \{y_1,\ldots,y_n\} are supplied, the package computes P(D_{m,n} \ge d), where d is the observed value of \Delta_{m,n} computed based on these two observed samples. Samples may come from any continuous, discrete or mixed distribution, i.e. the test allows repeated observations to appear in the user provided data samples \{x_1,\ldots,x_m\}, \{y_1,\ldots,y_n\} and their pooled sample \bm{Z}_{m+n}=\{x_1,\ldots,x_m,y_1,\ldots,y_n\}.

The two-sample (unweighted) Kuiper goodness-of-fit statistic is defined as:

\varsigma_{m,n} = \sup [F_{m}(t) - G_n(t)] - \inf [F_{m}(t) - G_n(t)].

It is widely used when the data samples are periodic or circular (data that are measured in radians). KSgeneral calculates the exact p-value P(V_{m,n} \ge q), where q\in [0,2] and V_{m,n} is the two-sample Kuiper goodness-of-fit test defined on the on the space, \Omega, as described above. If two data samples \{x_1,\ldots,x_m\} and \{y_1,\ldots,y_n\} are supplied, the package computes P(V_{m,n} \ge v), where v is the observed value of \varsigma_{m,n} computed based on these two observed samples. Similarly, as for the KS test, the two-sample Kuiper test also allows repeated observations in the user provided data samples \{x_1,\ldots,x_m\}, \{y_1,\ldots,y_n\} and their pooled sample \bm{Z}_{m+n}=\{x_1,\ldots,x_m,y_1,\ldots,y_n\}.

Details

One-sample KS test:

The Exact-KS-FFT method to compute p-values of the one-sample KS test in KSgeneral is based on expressing the p-value P(D_{n} \ge q) in terms of an appropriate rectangle probability with respect to the uniform order statistics, as noted by Gleser (1985) for P(D_{n} > q). The latter representation is used to express P(D_{n} \ge q) via a double-boundary non-crossing probability for a homogeneous Poisson process, with intensity n, which is then efficiently computed using FFT, ensuring total run-time of order O(n^{2}log(n)) (see Dimitrova, Kaishev, Tan (2020) and also Moscovich and Nadler (2017) for the special case when F(x) is continuous).

The code for the one-sample KS test in KSgeneral represents an R wrapper of the original C++ code due to Dimitrova, Kaishev, Tan (2020) and based on the C++ code developed by Moscovich and Nadler (2017). The package includes the functions disc_ks_c_cdf, mixed_ks_c_cdf and cont_ks_c_cdf that compute the complementary cdf P(D_n \ge q), for a fixed q, 0 \le q \le 1, when F(x) is purely discrete, mixed or continuous, respectively. KSgeneral includes also the functions disc_ks_test, mixed_ks_test and cont_ks_test that compute the p-value P(D_{n} \ge d_{n}), where d_{n} is the value of the KS test statistic computed based on a user provided data sample \{x_{1}, ..., x_{n}\}, when F(x) is purely discrete, mixed or continuous, respectively.

The functions disc_ks_test and cont_ks_test represent accurate and fast (run time O(n^{2}log(n))) alternatives to the functions ks.test from the package dgof and the function ks.test from the package stat, which compute p-values of P(D_{n} \ge d_{n}), assuming F(x) is purely discrete or continuous, respectively.

The package also includes the function ks_c_cdf_Rcpp which gives the flexibility to compute the complementary cdf (p-value) for the one-sided KS test statistics D_{n}^{-} or D_{n}^{+}. It also allows for faster computation time and possibly higher accuracy in computing P(D_{n} \ge q).

Two-sample KS test and Kuiper test:

The method underlying for computing p-values of the two-sample KS and Kuiper tests in KSgeneral is the extension of the algorithm due to Nikiforov (1994) and is based on expressing the p-value as the probability that a point sequence stays within a certain region in the two-dimensional integer-valued lattice. The algorithm for both tests uses a recursive formula to calculate the total number of point sequences within the region which is divided by the total number of elements in \Omega, i.e. \frac{(m+n)!}{m!n!} to obtain the probability.

For a particular realization of the pooled sample \bm{Z}_{m,n}=(X_{1},..., X_{m},Y_{1},..., Y_{n}), the p-values calculated by the functions KS2sample and Kuiper2sample are the probabilities:

P(D_{m,n}\geq q), P(V_{m,n}\geq q),

where D_{m,n} and V_{m,n} are the two-sample Kolmogorov-Smirnov and Kuiper test statistics respectively, for two samples \bm{X}'_m and \bm{Y}'_n of sizes m and n, randomly drawn from the pooled sample without replacement, i.e. they are defined on the space \Omega and q\in [0,1] for the KS test, q \in [0,2] for the Kuiper test.

Both KS2sample and Kuiper2sample implement algorithms which generalize the method due to Nikiforov (1994), and calculate the exact p-values of the KS test and the Kuiper test respectively. Both of them allow tested data samples to come from continuous, discrete or mixed distributions (ties are also allowed).

KS2sample ensures a total worst-case run-time of order O(nm). Compared with other known algorithms, it not only allows more flexible choices on weights leading to better power (see Dimitrova, Jia, Kaishev 2024), but also is more efficient and more generally applicable for large sample sizes. Kuiper2sample is accurate and valid for large sample sizes. It ensures a total worst-case run-time of order O((mn)^{2}). When m and n have large greatest common divisor (an extreme case is m = n), it ensures a total worst-case run-time of order O((m)^{2}n).

Author(s)

Dimitrina S. Dimitrova <D.Dimitrova@city.ac.uk>, Yun Jia <yunjia2019@gmail.com>, Vladimir K. Kaishev <Vladimir.Kaishev.1@city.ac.uk>, Senren Tan <raymondtsrtsr@outlook.com>

Maintainer: Dimitrina S. Dimitrova <D.Dimitrova@city.ac.uk>

References

Dimitrina S. Dimitrova, Vladimir K. Kaishev, Senren Tan. (2020) "Computing the Kolmogorov-Smirnov Distribution When the Underlying CDF is Purely Discrete, Mixed or Continuous". Journal of Statistical Software, 95(10): 1-42. doi:10.18637/jss.v095.i10.

Gleser L.J. (1985). "Exact Power of Goodness-of-Fit Tests of Kolmogorov Type for Discontinuous Distributions". Journal of the American Statistical Association, 80(392), 954-958.

Moscovich A., Nadler B. (2017). "Fast Calculation of Boundary Crossing Probabilities for Poisson Processes". Statistics and Probability Letters, 123, 177-182.

Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2024). "The R functions KS2sample and Kuiper2sample: Efficient Exact Calculation of P-values of the Two-sample Kolmogorov-Smirnov and Kuiper Tests". submitted

[Package KSgeneral version 2.0.2 Index]