R: euMMD: Efficient Univariate Maximum Mean Discrepancy

eummd {eummd}

R Documentation

euMMD: Efficient Univariate Maximum Mean Discrepancy

Description

Computes the maximum mean discrepancy statistic with the Laplacian kernel. Suitable only for univariate data. Computing the statistic alone for n observations is O(n \log n), and computing the p-value for L permutations is O(n \log n + Ln).

Usage

eummd(
  x,
  y,
  beta = -0.1,
  pval = TRUE,
  numperm = 200,
  seednum = 0,
  alternative = c("greater", "two.sided"),
  allowzeropval = FALSE
)

Arguments

`x`	Univariate vector of observations in first sample.
`y`	Univariate vector of observations in second sample.
`beta`	kernel parameter. Must be positive; if not, computes median heuristic in quadratic time. Default value is `-0.1`, which will force median heuristic to be used.
`pval`	Boolean for whether to compute p-value or not.
`numperm`	Number of permutations. Default is `200`.
`seednum`	Seed number for generating permutations. Default is `0`, which means seed is set randomly. For values larger than `0`, results will be reproducible.
`alternative`	A character string specifying the alternative hypothesis, which must be either `"greater"` (default) or `"two.sided"`. In Gretton et al., the MMD test statistic is specified so that if it is significantly larger than zero, then the null hypothesis that the two samples come from the same distribution should be rejected. For this reason, `"greater"` is recommended. The test will still work in many cases with `"two.sided"` specified, but this could lead to problems in certain cases.
`allowzeropval`	A boolean, specifying whether we will allow zero p-values or not. Default is `FALSE`; then a threshold of `0.5 / (numperm+1)` is used, and if the computed p-value is less than this threshold, it is then set to be this value. this avoids the possibility of zero p-values.

Details

If the total number of observations in both samples is n, first sort combined sample in O(n \log n) before remaining steps are linear in n.

If beta is not a positive value, median difference is computed as follows:

m = \textnormal{median} \{ || x_i - x_j ||_1; \,\, i>j, \,\, i=1, 2,\dots, n+m,\,\,\textnormal{ and } j=1, 2,\dots, i-1 \},

where || x_i - x_j ||_1 is the 1-norm, and so if the data are univariate then

|| x_i - x_j ||_1 = |x_{i} - x_{j}|.

and finally median heuristic is beta = 1/m. This can be computed in O(n \log n ) time using the algorithms of Johnson and Mizoguchi (1978) and Croux and Rousseuw (1992); see mediandiff for references.

The Laplacian kernel k is defined as

k(x,y) = \exp [ -\beta ||x - y||_1 ].

The random seed is set for std::mt19937 and std::shuffle in C++.

Value

A list with the following elements:

pval: The p-value of the test, if it is computed (pval=TRUE). Otherwise, it is set to NA.
stat: The statistic of the test, which is always computed.
beta: The kernel parameter used in the test. If beta was not initialised or negative, this will be the median heuristic value.

References

Bodenham, D. A., and Kawahara, Y. (2023) "euMMD: efficiently computing the MMD two-sample test statistic for univariate data." Statistics and Computing 33.5 (2023): 110.

Croux, C. and Rousseeuw, P. J. (1992), "Time-Efficient Algorithms for Two Highly Robust Estimators of Scale" In Computational Statistics: Volume 1: Proceedings of the 10th Symposium on Computational Statistics (pp. 411-428).

Johnson, D.B., and Mizoguchi, T. (1978), "Selecting the Kth Element in X + Y and X_1 + X_2 + ... + X_m", SIAM Journal of Computing, 7, 147-153.

Examples

x <- c(7.1, 1.2, 4.3, 0.4)
y <- c(5.5, 2.6, 8.7)
#setting the kernel parameter to be 0.1; setting seed=1 for reproducibility
mmd_list <- eummd(x, y, beta=0.1, seednum=1)

#now using median heuristic (default)
mmd_list <- eummd(x, y, seednum=1)

#now not computing the p-value, only the statistic
mmd_list <- eummd(x, y, pval=FALSE, seednum=1)

#now using a larger number of permutations 
mmd_list <- eummd(x, y, numperm=1000, seednum=1)

[Package eummd version 0.1.9 Index]