eummd {eummd} | R Documentation |
euMMD: Efficient Univariate Maximum Mean Discrepancy
Description
Computes the maximum mean discrepancy statistic with the Laplacian kernel.
Suitable only for univariate data. Computing the statistic alone
for n
observations is O(n \log n)
, and computing the
p-value for L
permutations is O(n \log n + Ln)
.
Usage
eummd(
x,
y,
beta = -0.1,
pval = TRUE,
numperm = 200,
seednum = 0,
alternative = c("greater", "two.sided"),
allowzeropval = FALSE
)
Arguments
x |
Univariate vector of observations in first sample. |
y |
Univariate vector of observations in second sample. |
beta |
kernel parameter. Must be positive; if not, computes
median heuristic in quadratic time. Default value
is |
pval |
Boolean for whether to compute p-value or not. |
numperm |
Number of permutations. Default is |
seednum |
Seed number for generating permutations. Default is |
alternative |
A character string specifying the alternative hypothesis,
which must be either |
allowzeropval |
A boolean, specifying whether we will allow zero
p-values or not. Default is |
Details
If the total number of observations in both samples is n
,
first sort combined sample in O(n \log n)
before remaining
steps are linear in n
.
If beta
is not a positive value,
median difference is computed as follows:
m = \textnormal{median} \{ || x_i - x_j ||_1; \,\, i>j, \,\,
i=1, 2,\dots, n+m,\,\,\textnormal{ and } j=1, 2,\dots, i-1 \},
where || x_i - x_j ||_1
is the 1-norm, and
so if the data are univariate then
|| x_i - x_j ||_1 = |x_{i} - x_{j}|.
and finally median heuristic is beta = 1/m
.
This can be computed in O(n \log n )
time
using the algorithms of Johnson and Mizoguchi (1978)
and Croux and Rousseuw (1992); see mediandiff
for references.
The Laplacian kernel k
is defined as
k(x,y) = \exp [ -\beta ||x - y||_1 ].
The random seed is set for std::mt19937
and std::shuffle
in C++.
Value
A list with the following elements:
pval
The p-value of the test, if it is computed (
pval=TRUE
). Otherwise, it is set toNA
.stat
The statistic of the test, which is always computed.
beta
The kernel parameter used in the test. If
beta
was not initialised or negative, this will be the median heuristic value.
References
Bodenham, D. A., and Kawahara, Y. (2023) "euMMD: efficiently computing the MMD two-sample test statistic for univariate data." Statistics and Computing 33.5 (2023): 110.
Croux, C. and Rousseeuw, P. J. (1992), "Time-Efficient Algorithms for Two Highly Robust Estimators of Scale" In Computational Statistics: Volume 1: Proceedings of the 10th Symposium on Computational Statistics (pp. 411-428).
Johnson, D.B., and Mizoguchi, T. (1978), "Selecting the Kth Element in X + Y and X_1 + X_2 + ... + X_m", SIAM Journal of Computing, 7, 147-153.
See Also
mediandiff
Examples
x <- c(7.1, 1.2, 4.3, 0.4)
y <- c(5.5, 2.6, 8.7)
#setting the kernel parameter to be 0.1; setting seed=1 for reproducibility
mmd_list <- eummd(x, y, beta=0.1, seednum=1)
#now using median heuristic (default)
mmd_list <- eummd(x, y, seednum=1)
#now not computing the p-value, only the statistic
mmd_list <- eummd(x, y, pval=FALSE, seednum=1)
#now using a larger number of permutations
mmd_list <- eummd(x, y, numperm=1000, seednum=1)