eqdist.etest {energy} R Documentation

## Multisample E-statistic (Energy) Test of Equal Distributions

### Description

Performs the nonparametric multisample E-statistic (energy) test for equality of multivariate distributions.

### Usage

eqdist.etest(x, sizes, distance = FALSE,
method=c("original","discoB","discoF"), R)
eqdist.e(x, sizes, distance = FALSE,
method=c("original","discoB","discoF"))
ksample.e(x, sizes, distance = FALSE,
method=c("original","discoB","discoF"), ix = 1:sum(sizes))


### Arguments

 x data matrix of pooled sample sizes vector of sample sizes distance logical: if TRUE, first argument is a distance matrix method use original (default) or distance components (discoB, discoF) R number of bootstrap replicates ix a permutation of the row indices of x

### Details

The k-sample multivariate \mathcal{E}-test of equal distributions is performed. The statistic is computed from the original pooled samples, stacked in matrix x where each row is a multivariate observation, or the corresponding distance matrix. The first sizes rows of x are the first sample, the next sizes rows of x are the second sample, etc.

The test is implemented by nonparametric bootstrap, an approximate permutation test with R replicates.

The function eqdist.e returns the test statistic only; it simply passes the arguments through to eqdist.etest with R = 0.

The k-sample multivariate \mathcal{E}-statistic for testing equal distributions is returned. The statistic is computed from the original pooled samples, stacked in matrix x where each row is a multivariate observation, or from the distance matrix x of the original data. The first sizes rows of x are the first sample, the next sizes rows of x are the second sample, etc.

The two-sample \mathcal{E}-statistic proposed by Szekely and Rizzo (2004) is the e-distance e(S_i,S_j), defined for two samples S_i, S_j of size n_i, n_j by

e(S_i,S_j)=\frac{n_i n_j}{n_i+n_j}[2M_{ij}-M_{ii}-M_{jj}], 

where

M_{ij}=\frac{1}{n_i n_j}\sum_{p=1}^{n_i} \sum_{q=1}^{n_j} \|X_{ip}-X_{jq}\|,

\|\cdot\| denotes Euclidean norm, and X_{ip} denotes the p-th observation in the i-th sample.

The original (default method) k-sample \mathcal{E}-statistic is defined by summing the pairwise e-distances over all k(k-1)/2 pairs of samples:

\mathcal{E}=\sum_{1 \leq i < j \leq k} e(S_i,S_j). 

Large values of \mathcal{E} are significant.

The discoB method computes the between-sample disco statistic. For a one-way analysis, it is related to the original statistic as follows. In the above equation, the weights \frac{n_i n_j}{n_i+n_j} are replaced with

\frac{n_i + n_j}{2N}\frac{n_i n_j}{n_i+n_j} = \frac{n_i n_j}{2N}

where N is the total number of observations: N=n_1+...+n_k.

The discoF method is based on the disco F ratio, while the discoB method is based on the between sample component.

Also see disco and disco.between functions.

### Value

A list with class htest containing

 method description of test statistic observed value of the test statistic p.value approximate p-value of the test data.name description of data

eqdist.e returns test statistic only.

### Note

The pairwise e-distances between samples can be conveniently computed by the edist function, which returns a dist object.

### Author(s)

Maria L. Rizzo mrizzo@bgsu.edu and Gabor J. Szekely

### References

Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal Distributions in High Dimension, InterStat, November (5).

M. L. Rizzo and G. J. Szekely (2010). DISCO Analysis: A Nonparametric Extension of Analysis of Variance, Annals of Applied Statistics, Vol. 4, No. 2, 1034-1055.
doi: 10.1214/09-AOAS245

Szekely, G. J. (2000) Technical Report 03-05: \mathcal{E}-statistics: Energy of Statistical Samples, Department of Mathematics and Statistics, Bowling Green State University.

ksample.e, edist, disco, disco.between, energy.hclust.

### Examples

 data(iris)

## test if the 3 varieties of iris data (d=4) have equal distributions
eqdist.etest(iris[,1:4], c(50,50,50), R = 199)

## example that uses method="disco"
x <- matrix(rnorm(100), nrow=20)
y <- matrix(rnorm(100), nrow=20)
X <- rbind(x, y)
d <- dist(X)

# should match edist default statistic
set.seed(1234)
eqdist.etest(d, sizes=c(20, 20), distance=TRUE, R = 199)

# comparison with edist
edist(d, sizes=c(20, 10), distance=TRUE)

# for comparison
g <- as.factor(rep(1:2, c(20, 20)))
set.seed(1234)
disco(d, factors=g, distance=TRUE, R=199)

# should match statistic in edist method="discoB", above
set.seed(1234)
disco.between(d, factors=g, distance=TRUE, R=199)


[Package energy version 1.7-10 Index]