KS2sample {KSgeneral} | R Documentation |
Computes the p-value for a (weighted) two-sample Kolmogorov-Smirnov test, given an arbitrary positive weight function and arbitrary data samples with possibly repeated observations (i.e. ties)
Description
Computes the p-value P(D_{m,n} \ge q)
, where D_{m,n}
is the one- or two-sided two-sample Kolmogorov-Smirnov test statistic with weight function weight
, when q
= d
, i.e. the observed value of KS statistic computed based on two data samples \{x_{1},..., x_{m}\}
and \{y_{1},..., y_{n}\}
that may come from continuous, discrete or mixed distribution, i.e. they may have repeated observations (ties).
Usage
KS2sample(x, y, alternative = c("two.sided", "less", "greater"),
conservative = F, weight = 0, tol = 1e-08, tail = T)
Arguments
x |
a numeric vector of data sample values |
y |
a numeric vector of data sample values |
alternative |
Indicates the alternative hypothesis and must be one of "two.sided" (default), "less", or "greater". One can specify just the initial letter of the string, but the argument name must be given in full, e.g. |
conservative |
logical variable indicating whether ties should be considered. See ‘Details’ for the meaning. |
weight |
either a numeric value between 0 and 1 which specifies the form of the weight function from a class of pre-defined functions, or a user-defined strictly positive function of one variable. By default, no weight function is assumed. See ‘Details’ for the meaning of the possible values. |
tol |
the value of |
tail |
logical variable indicating whether a p-value, |
Details
Given a pair of random samples \bm{X}_m=(X_{1},..., X_{m})
and \bm{Y}_n=(Y_{1},..., Y_{n})
of sizes m
and n
with empirical cdfs F_{m}(t)
and G_{n}(t)
respectively, coming from some unknown cdfs F(x)
and G(x)
. It is assumed that F(x)
and G(x)
could be either continuous, discrete or mixed, which means that repeated observations are allowed in the corresponding observed samples. The task is to test the null hypothesis H_0: F(x) = G(x)
for all x
, either against the alternative hypothesis H_1: F(x)\neq G(x)
for at least one x
, which corresponds to the two-sided test, or against H_1: F(x)> G(x)
and H_1: F(x)< G(x)
for at least one x
, which corresponds to the two one-sided tests. The (weighted) two-sample Kolmogorov-Smirnov goodness-of-fit statistics that are used to test these hypotheses are generally defined as:
\Delta_{m,n} = \sup |F_{m}(t) - G_n(t)|W(E_{m+n}(t), \textnormal{ to test against the alternative } H_1: F(x)\neq G(x)
\Delta_{m,n}^{+} = \sup [F_{m}(t) - G_n(x)]W(E_{m+n}(t)), \textnormal{ to test against the alternative } H_1: F(x)> G(x)
\Delta_{m,n}^{-} = \sup [G_n(t) - F_{m}(x)]W(E_{m+n}(t)), \textnormal{ to test against the alternative } H_1: F(x)< G(x),
where E_{m+n}(t)
is the empirical cdf of the pooled sample \bm{Z}_{m,n}=(X_{1},..., X_{m},Y_{1},..., Y_{n})
, W( )
is a strictly positive weight function defined on [0,1]
.
Possible values of alternative
are "two.sided"
, "greater"
and "less"
which specify the alternative hypothesis, i.e. specify the test statistics to be either \Delta_{m,n}
, \Delta_{m,n}^{+}
or \Delta_{m,n}^{-}
respectively.
When weight
is assigned with a numeric value \nu
between 0
and 1
, the test statistic is specified as the weighted two-sample Kolmorogov-Smirnov test with generalized Anderson-Darling weight W(t)=1/[t(1-t)]^{\nu}
(see Finner and Gontscharuk 2018). Then for example, the two-sided two-sample Kolmogorov-Smirnov statistic has the following form:
\Delta_{m,n}=\sup\limits_{t} \frac{|F_m(t)-G_n(t)|}{[E_{m+n}(t)(1-E_{m+n}(t))]^{\nu}}
The latter specification defines a family of weighted Kolmogorov-Smirnov tests, covering the unweighted test (when weight =
\nu = 0
), and the widely-known weighted Kolmogorov-Smirnov test with Anderson-Darling weight (when weight = 0.5
, see definition of this statistic also in Canner 1975).
If one wants to implement a weighted test with a user-specified weight function, for example, W(t)=1/[t(2-t)]^{1/2}
suggested by Buning (2001), which ensures higher power when both x
and y
come from distributions that are left-skewed and heavy-tailed, one can directly assign a univariate function with output value 1/sqrt(t*(2-t))
to weight
. See ‘Examples’ for this demonstration.
For a particular realization of the pooled sample \bm{Z}_{m,n}
, let there be k
distinct values, a_1<a_2<...<a_k
, in the ordered, pooled sample (z_1\leq z_2\leq \ldots \leq z_{m+n})
, where k\leq m+n
, and where m_i
is the number of times a_i
, i=1,\ldots,k
appears in the pooled sample. The p-value is then defined as the probability
p=P\left(D_{m,n}\geq q\right),
where D_{m,n}
is the two-sample Kolmogorov-Smirnov test statistic defined according to the value of weight
and alternative
, for two samples \bm{X}'_m
and \bm{Y}'_n
of sizes m
and n
, randomly drawn from the pooled sample without replacement and q
= d
, the observed value of the statistic calculated based on the user provided data samples x
and y
. By default tail = T
, the p-value is returned, otherwise 1 - p
is returned.
Note that, D_{m,n}
is defined on the space \Omega
of all possible pairs, C = \frac{(m+n)!}{m!n!}
of edfs F_m(x,\omega)
and G_n(x,\omega)
, \omega \in \Omega
, that correspond to the pairs of samples \bm{X}'_m
and \bm{Y}'_n
, randomly drawn from, \bm{Z}_{m+n}
, as follows. First, m
observations are drawn at random without replacement, forming the first sample \bm{X}'_m
, with corresponding edf, F_m(x,\omega)
. The remaining n
observations are then assigned to the second sample \bm{Y}'_n
, with corresponding edf G_n(x,\omega)
. Observations are then replaced back in \bm{Z}_{m+n}
and re-sampling is continued until the occurrence of all the C
possible pairs of edfs F_m(x,\omega)
and G_n(x,\omega)
, \omega \in \Omega
. The pairs of edf's may be coincident if there are ties in the data and each pair, F_m(x,\omega)
and G_n(x,\omega)
occurs with probability 1/C
.
conservative
is a logical variable whether the test should be conducted conservatively. By default, conservative = F
, KS2sample
returns the p-value that is defined through the conditional probability above. However, when the user has a priori knowledge that both samples are from a continuous distribution even if ties are present, for example, repeated observations are caused by rounding errors, the value conservative = T
should be assigned, since the conditional probability is no longer relevant. In this case, KS2sample
computes p-values for the Kolmogorov-Smirnov test assuming no ties are present, and returns a p-value which is an upper bound of the true p-value. Note that, if the null hypothesis is rejected using the calculated upper bound for the p-value, it should also be rejected with the true p-value.
KS2sample
calculates the exact p-value of the KS test using an algorithm which generalizes the method due to Nikiforov (1994). If tail = F
, KS2sample
calculates the complementary p-value, 1 - p
. For the purpose, an exact algorithm which generalizes the method due to Nikiforov (1994) is implemented. Alternatively, if tail = T
, a version of the Nikiforov's recurrence proposed recently by Viehmann (2021) is implemented, which computes directly the p-value, with higher accuracy, giving up to 17 correct digits, but at up to 3 times higher computational cost. KS2sample
ensures a total worst-case run-time of order O(nm)
. In comparison with other known algorithms, it not only allows the flexible choice of weights which in some cases improve the statistical power (see Dimitrova, Jia, Kaishev 2024), but also is more efficient and generally applicable for large sample sizes.
Value
A list with class "htest"
containing the following components:
statistic |
the value of the test statistic |
p.value |
the p-value of the test. |
alternative |
a character string describing the alternative hypothesis. |
data.name |
a character string giving names of the data. |
Source
Based on the Fortran subroutine by Nikiforov (1994). See also Dimitrova, Jia, Kaishev (2024).
References
Buning H (2001). "Kolmogorov-Smirnov- and Cramer-von Mises Type Two-sample Tests With Various Weight Functions." Communications in Statistics - Simulation and Computation, 30(4), 847-865.
Finner H, Gontscharuk V (2018). "Two-sample Kolmogorov-Smirnov-type tests revisited: Old and new tests in terms of local levels." The Annals of Statistics, 46(6A), 3014-3037.
Paul L. Canner (1975). "A Simulation Study of One- and Two-Sample Kolmogorov-Smirnov Statistics with a Particular Weight Function". Journal of the American Statistical Association, 70(349), 209-211.
Nikiforov, A. M. (1994). "Algorithm AS 288: Exact Smirnov Two-Sample Tests for Arbitrary Distributions." Journal of the Royal Statistical Society. Series C (Applied Statistics), 43(1), 265-270.
Viehmann, T. (2021). Numerically more stable computation of the p-values for the two-sample Kolmogorov-Smirnov test. arXiv preprint arXiv:2102.08037.
Dimitrina S. Dimitrova, Yun Jia, Vladimir K. Kaishev (2024). "The R functions KS2sample and Kuiper2sample: Efficient Exact Calculation of P-values of the Two-sample Kolmogorov-Smirnov and Kuiper Tests". submitted
Examples
##Computes p-value of two-sided unweighted test for continuous data
data1 <- rexp(750, 1)
data2 <- rexp(800, 1)
KS2sample(data1, data2)
##Computes the complementary p-value
KS2sample(data1, data2, tail = FALSE)
##Computes p-value of one-sided test with Anderson-Darling weight function
KS2sample(data1, data2, alternative = "greater", weight = 0.5)
##Computes p-values of two-sided test with Buning's weight function for discrete data
data3 <- rnbinom(100, size = 3, prob = 0.6)
data4 <- rpois(120, lambda = 2)
f <- function(t) 1 / sqrt( t * (2 - t) )
KS2sample(data3, data4, weight = f)