R: Ball Divergence statistic

bd {Ball}

R Documentation

Ball Divergence statistic

Description

Compute Ball Divergence statistic, which is a generic dispersion measure in Banach spaces.

Usage

bd(
  x,
  y = NULL,
  distance = FALSE,
  size = NULL,
  num.threads = 1,
  kbd.type = c("sum", "maxsum", "max")
)

Arguments

`x`	a numeric vector, matrix, data.frame, or a list containing at least two numeric vectors, matrices, or data.frames.
`y`	a numeric vector, matrix, data.frame.
`distance`	if `distance = TRUE`, the elements of `x` will be considered as a distance matrix. Default: `distance = FALSE`.
`size`	a vector recording sample size of each group.
`num.threads`	number of threads. If `num.threads = 0`, then all of available cores will be used. Default `num.threads = 0`.
`kbd.type`	a character string specifying the `K`-sample Ball Divergence test statistic, must be one of `"sum"`, `"summax"`, or `"max"`. Any unambiguous substring can be given. Default `kbd.type = "sum"`.

Details

Given the samples not containing missing values, bd returns Ball Divergence statistics. If we set distance = TRUE, arguments x, y can be a dist object or a symmetric numeric matrix recording distance between samples; otherwise, these arguments are treated as data.

Ball divergence statistic measure the distribution difference of two datasets in Banach spaces. The Ball divergence statistic is proven to be zero if and only if two datasets are identical.

The definition of the Ball Divergence statistics is as follows. Given two independent samples \{x_{1}, \ldots, x_{n}\} with the associated probability measure \mu and \{y_{1}, \ldots, y_{m}\} with \nu, where the observations in each sample are i.i.d. Let \delta(x,y,z)=I(z\in \bar{B}(x, \rho(x,y))), where \delta(x,y,z) indicates whether z is located in the closed ball \bar{B}(x, \rho(x,y)) with center x and radius \rho(x, y). We denote:

A_{ij}^{X}=\frac{1}{n}\sum_{u=1}^{n}{\delta(X_i,X_j,X_u)}, \quad A_{ij}^{Y}=\frac{1}{m}\sum_{v=1}^{m}{\delta(X_i,X_j,Y_v)},

C_{kl}^{X}=\frac{1}{n}\sum_{u=1}^{n}{\delta(Y_k,Y_l,X_u)}, \quad C_{kl}^{Y}=\frac{1}{m}\sum_{v=1}^{m}{\delta(Y_k,Y_l,Y_v)}.

A_{ij}^X represents the proportion of samples \{x_{1}, \ldots, x_{n}\} located in the ball \bar{B}(X_i,\rho(X_i,X_j)) and A_{ij}^Y represents the proportion of samples \{y_{1}, \ldots, y_{m}\} located in the ball \bar{B}(X_i,\rho(X_i,X_j)). Meanwhile, C_{kl}^X and C_{kl}^Y represent the corresponding proportions located in the ball \bar{B}(Y_k,\rho(Y_k,Y_l)). The Ball Divergence statistic is defined as:

D_{n,m}=A_{n,m}+C_{n,m}

Ball Divergence can be generalized to the K-sample test problem. Suppose we have K group samples, each group include n_{k} samples. The definition of K-sample Ball Divergence statistic could be to directly sum up the two-sample Ball Divergence statistics of all sample pairs (kbd.type = "sum")

\sum_{1 \leq k < l \leq K}{D_{n_{k},n_{l}}},

or to find one sample with the largest difference to the others (kbd.type = "maxsum")

\max_{t}{\sum_{s=1, s \neq t}^{K}{D_{n_{s}, n_{t}}},}

to aggregate the K-1 most significant different two-sample Ball Divergence statistics (kbd.type = "max")

\sum_{k=1}^{K-1}{D_{(k)}},

where D_{(1)}, \ldots, D_{(K-1)} are the largest K-1 two-sample Ball Divergence statistics among \{D_{n_s, n_t}| 1 \leq s < t \leq K\}. When K=2, the three types of Ball Divergence statistics degenerate into two-sample Ball Divergence statistic.

See bd.test for a test of distribution equality based on the Ball Divergence.

Value

`bd`	Ball Divergence statistic

Author(s)

Wenliang Pan, Yuan Tian, Xueqin Wang, Heping Zhang

References

Wenliang Pan, Yuan Tian, Xueqin Wang, Heping Zhang. Ball Divergence: Nonparametric two sample test. Ann. Statist. 46 (2018), no. 3, 1109–1137. doi:10.1214/17-AOS1579. https://projecteuclid.org/euclid.aos/1525313077

Examples

############# Ball Divergence #############
x <- rnorm(50)
y <- rnorm(50)
bd(x, y)

[Package Ball version 1.3.13 Index]