R: Naive computation for Energy Distance

energydist {eummd}

R Documentation

Naive computation for Energy Distance

Description

Computes energy distance, and possibly a p-value. Suitable for multivariate data. Naive approach, quadratic in number of observations.

Usage

energydist(
  X,
  Y,
  pval = TRUE,
  numperm = 200,
  seednum = 0,
  alternative = c("greater", "two.sided"),
  allowzeropval = FALSE
)

Arguments

`X`	Matrix (or vector) of observations in first sample.
`Y`	Matrix (or vector) of observations in second sample.
`pval`	Boolean for whether to compute p-value or not.
`numperm`	Number of permutations. Default is `200`.
`seednum`	Seed number for generating permutations. Default is `0`, which means seed is set randomly. For values larger than `0`, results will be reproducible.
`alternative`	A character string specifying the alternative hypothesis, which must be either `"greater"` (default) or `"two.sided"`. In Gretton et al., the MMD test statistic is specified so that if it is significantly larger than zero, then the null hypothesis that the two samples come from the same distribution should be rejected. For this reason, `"greater"` is recommended. The test will still work in many cases with `"two.sided"` specified, but this could lead to problems in certain cases.
`allowzeropval`	A boolean, specifying whether we will allow zero p-values or not. Default is `FALSE`; then a threshold of `0.5 / (numperm+1)` is used, and if the computed p-value is less than this threshold, it is then set to be this value. this avoids the possibility of zero p-values.

Details

First checks number of columns (dimension) are equal. Suppose matrix X has n rows and d columns, and matrix Y has m rows; checks that Y has d columns (if not, then throws error). Then flattens matrices to vectors (or, if d=1, they are already vectors. Then calls C++ method. If the first sample has n d-dimensional samples and the second sample has m d-dimensional samples, then the algorithm computes the statistic in O((n+m)^2) time.

Random seed is set for std::mt19937 and std::shuffle in C++.

Value

A list with the following elements:

pval: The p-value of the test, if it is computed (pval=TRUE).
stat: The statistic of the test, which is always computed.

References

Baringhaus L. and Franz C. (2004) "On a new multivariate two-sample test." Journal of multivariate analysis 88(1):190-206

Szekely G. J. and Rizzo M. L. (2004) "Testing for equal distributions in high dimension." InterStat 5(16.10):1249-1272

Examples


X <- matrix(c(1:12), ncol=2, byrow=TRUE)
Y <- matrix(c(13:20), ncol=2, byrow=TRUE)
energydistList <- energydist(X=X, Y=Y, pval=FALSE)

#computing p-value
energydistList <- energydist(X=X, Y=Y)

#computing p-value
#using 1000 permutations and seed 1 for reproducibility.
energydistList <- energydist(X=X, Y=Y, numperm=1000, seednum=1)

[Package eummd version 0.1.9 Index]