mmd {eummd}R Documentation

Naive computation for Maximum Mean Discrepancy

Description

Computes maximum mean discrepancy statistics with Laplacian or Gaussian kernel. Suitable for multivariate data. Naive approach, quadratic in number of observations.

Usage

mmd(
  X,
  Y,
  beta = -0.1,
  pval = TRUE,
  kernel = c("Laplacian", "Gaussian"),
  numperm = 200,
  seednum = 0,
  alternative = c("greater", "two.sided"),
  allowzeropval = FALSE
)

Arguments

X

Matrix (or vector) of observations in first sample.

Y

Matrix (or vector) of observations in second sample.

beta

kernel parameter. Must be positive; if not, computes median heuristic in quadratic time. Default value is -0.1, which will force median heuristic to be used.

pval

Boolean for whether to compute p-value or not.

kernel

String, either "Laplacian" or "Gaussian". Default is "Laplacian".

numperm

Number of permutations. Default is 200.

seednum

Seed number for generating permutations. Default is 0, which means seed is set randomly. For values larger than 0, results will be reproducible.

alternative

A character string specifying the alternative hypothesis, which must be either "greater" (default) or "two.sided". In Gretton et al., the MMD test statistic is specified so that if it is significantly larger than zero, then the null hypothesis that the two samples come from the same distribution should be rejected. For this reason, "greater" is recommended. The test will still work in many cases with "two.sided" specified, but this could lead to problems in certain cases.

allowzeropval

A boolean, specifying whether we will allow zero p-values or not. Default is FALSE; then a threshold of 0.5 / (numperm+1) is used, and if the computed p-value is less than this threshold, it is then set to be this value. this avoids the possibility of zero p-values.

Details

First checks number of columns (dimension) are equal. Suppose matrix X has n rows and d columns, and matrix Y has m rows; checks that Y has d columns (if not, then throws error). Then flattens matrices to vectors (or, if d=1, they are already vectors. Then calls C++ method. If the first sample has n d-dimensional samples and the second sample has m d-dimensional samples, then the algorithm computes the statistic in O((n+m)^2) time.

Median difference is as follows:

m = \textnormal{median} \{ || x_i - x_j ||_1; \,\, i>j, \,\, i=1, 2,\dots, n+m,\,\,\textnormal{ and } j=1, 2,\dots, i-1 \},

where || x_i - x_j ||_1 is the 1-norm, and so if the data are d-dimensional then

|| x_i - x_j ||_1 = \sum_{k=1}^{d} |x_{i,k} - x_{j,k}|,

and finally median heuristic is beta = 1/m. This can be computed in O( (n+m)^2 ) time.

The Laplacian kernel k is defined as

k(x,y) = \exp( -\beta || x_i - x_j ||_1 ).

Random seed is set for std::mt19937 and std::shuffle in C++.

Value

A list with the following elements:

pval

The p-value of the test, if it is computed (pval=TRUE).

stat

The statistic of the test, which is always computed.

beta

The kernel parameter used in the test. If beta was not initialised or negative, this will be the median heuristic value.

References

Gretton, A., Borgwardt, K. M., Rasch M. J., Schölkopf, B. and Smola, A. (2012) "A kernel two-sample test." The Journal of Machine Learning Research 13, no. 1, 723-773.

Examples


X <- matrix(c(1:12), ncol=2, byrow=TRUE)
Y <- matrix(c(13:20), ncol=2, byrow=TRUE)
mmdList <- mmd(X=X, Y=Y, beta=0.1, pval=FALSE)

#using median heuristic
mmdList <- mmd(X=X, Y=Y, pval=FALSE)

#using median heuristic and computing p-value
mmdList <- mmd(X=X, Y=Y)

#using median heuristic and computing p-value
#using 1000 permutations and seed 1 for reproducibility.
mmdList <- mmd(X=X, Y=Y, numperm=1000, seednum=1)



[Package eummd version 0.1.9 Index]