R: Multisample E-statistic (Energy) Test of Equal Distributions

eqdist.etest {energy}

R Documentation

Multisample E-statistic (Energy) Test of Equal Distributions

Description

Performs the nonparametric multisample E-statistic (energy) test for equality of multivariate distributions.

Usage

eqdist.etest(x, sizes, distance = FALSE,
    method=c("original","discoB","discoF"), R)
eqdist.e(x, sizes, distance = FALSE,
    method=c("original","discoB","discoF"))
ksample.e(x, sizes, distance = FALSE,
    method=c("original","discoB","discoF"), ix = 1:sum(sizes))

Arguments

`x`	data matrix of pooled sample
`sizes`	vector of sample sizes
`distance`	logical: if TRUE, first argument is a distance matrix
`method`	use original (default) or distance components (discoB, discoF)
`R`	number of bootstrap replicates
`ix`	a permutation of the row indices of x

Details

The k-sample multivariate \mathcal{E}-test of equal distributions is performed. The statistic is computed from the original pooled samples, stacked in matrix x where each row is a multivariate observation, or the corresponding distance matrix. The first sizes[1] rows of x are the first sample, the next sizes[2] rows of x are the second sample, etc.

The test is implemented by nonparametric bootstrap, an approximate permutation test with R replicates.

The function eqdist.e returns the test statistic only; it simply passes the arguments through to eqdist.etest with R = 0.

The k-sample multivariate \mathcal{E}-statistic for testing equal distributions is returned. The statistic is computed from the original pooled samples, stacked in matrix x where each row is a multivariate observation, or from the distance matrix x of the original data. The first sizes[1] rows of x are the first sample, the next sizes[2] rows of x are the second sample, etc.

The two-sample \mathcal{E}-statistic proposed by Szekely and Rizzo (2004) is the e-distance e(S_i,S_j), defined for two samples S_i, S_j of size n_i, n_j by

e(S_i,S_j)=\frac{n_i n_j}{n_i+n_j}[2M_{ij}-M_{ii}-M_{jj}],

where

M_{ij}=\frac{1}{n_i n_j}\sum_{p=1}^{n_i} \sum_{q=1}^{n_j} \|X_{ip}-X_{jq}\|,

\|\cdot\| denotes Euclidean norm, and X_{ip} denotes the p-th observation in the i-th sample.

The original (default method) k-sample \mathcal{E}-statistic is defined by summing the pairwise e-distances over all k(k-1)/2 pairs of samples:

\mathcal{E}=\sum_{1 \leq i < j \leq k} e(S_i,S_j).

Large values of \mathcal{E} are significant.

The discoB method computes the between-sample disco statistic. For a one-way analysis, it is related to the original statistic as follows. In the above equation, the weights \frac{n_i n_j}{n_i+n_j} are replaced with

\frac{n_i + n_j}{2N}\frac{n_i n_j}{n_i+n_j} = \frac{n_i n_j}{2N}

where N is the total number of observations: N=n_1+...+n_k.

The discoF method is based on the disco F ratio, while the discoB method is based on the between sample component.

Also see disco and disco.between functions.

Value

A list with class htest containing

`method`	description of test
`statistic`	observed value of the test statistic
`p.value`	approximate p-value of the test
`data.name`	description of data

eqdist.e returns test statistic only.

Note

The pairwise e-distances between samples can be conveniently computed by the edist function, which returns a dist object.

Author(s)

Maria L. Rizzo mrizzo@bgsu.edu and Gabor J. Szekely

References

Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal Distributions in High Dimension, InterStat, November (5).

M. L. Rizzo and G. J. Szekely (2010). DISCO Analysis: A Nonparametric Extension of Analysis of Variance, Annals of Applied Statistics, Vol. 4, No. 2, 1034-1055.
doi:10.1214/09-AOAS245

Szekely, G. J. (2000) Technical Report 03-05: \mathcal{E}-statistics: Energy of Statistical Samples, Department of Mathematics and Statistics, Bowling Green State University.

Examples

 data(iris)

 ## test if the 3 varieties of iris data (d=4) have equal distributions
 eqdist.etest(iris[,1:4], c(50,50,50), R = 199)

 ## example that uses method="disco"
  x <- matrix(rnorm(100), nrow=20)
  y <- matrix(rnorm(100), nrow=20)
  X <- rbind(x, y)
  d <- dist(X)

  # should match edist default statistic
  set.seed(1234)
  eqdist.etest(d, sizes=c(20, 20), distance=TRUE, R = 199)

  # comparison with edist
  edist(d, sizes=c(20, 10), distance=TRUE)

  # for comparison
  g <- as.factor(rep(1:2, c(20, 20)))
  set.seed(1234)
  disco(d, factors=g, distance=TRUE, R=199)

  # should match statistic in edist method="discoB", above
  set.seed(1234)
  disco.between(d, factors=g, distance=TRUE, R=199)

[Package energy version 1.7-11 Index]