sdcov {semidist}R Documentation

Semi-distance covariance and correlation statistics

Description

Compute the statistics (or sample estimates) of semi-distance covariance and correlation. The semi-distance correlation is a standardized version of semi-distance covariance, and it can measure the dependence between a multivariate continuous variable and a categorical variable. See Details for the definition of semi-distance covariance and semi-distance correlation.

Usage

sdcov(X, y, type = "V", return_mat = FALSE)

sdcor(X, y)

Arguments

X

Data of multivariate continuous variables, which should be an nn-by-pp matrix, or, a vector of length nn (for univariate variable).

y

Data of categorical variables, which should be a factor of length nn.

type

Type of statistic: "V" (the default) or "U". See Details.

return_mat

A boolean. If FALSE (the default), only the calculated statistic is returned. If TRUE, also return the matrix of the distances of X and the divergences of y, which is useful for the permutation test.

Details

For XRp\bm{X} \in \mathbb{R}^{p} and Y{1,2,,R}Y \in \{1, 2, \cdots, R\}, the (population-level) semi-distance covariance is defined as

SDcov(X,Y)=E[XX~(1r=1RI(Y=r,Y~=r)/pr)],\mathrm{SDcov}(\bm{X}, Y) = \mathrm{E}\left[\|\bm{X}-\widetilde{\bm{X}}\|\left(1-\sum_{r=1}^R I(Y=r,\widetilde{Y}=r)/p_r\right)\right],

where pr=P(Y=r)p_r = P(Y = r) and (X~,Y~)(\widetilde{\bm{X}}, \widetilde{Y}) is an iid copy of (X,Y)(\bm{X}, Y). The (population-level) semi-distance correlation is defined as

SDcor(X,Y)=SDcov(X,Y)dvar(X)R1,\mathrm{SDcor}(\bm{X}, Y) = \dfrac{\mathrm{SDcov}(\bm{X}, Y)}{\mathrm{dvar}(\bm{X})\sqrt{R-1}},

where dvar(X)\mathrm{dvar}(\bm{X}) is the distance variance (Szekely, Rizzo, and Bakirov 2007) of X\bm{X}.

With nn observations {(Xi,Yi)}i=1n\{(\bm{X}_i, Y_i)\}_{i=1}^{n}, sdcov() and sdcor() can compute the sample estimates for the semi-distance covariance and correlation.

If type = "V", the semi-distance covariance statistic is computed as a V-statistic, which takes a very similar form as the energy-based statistic with double centering, and is always non-negative. Specifically,

SDcovn(X,y)=1n2k=1nl=1nAklBkl,\text{SDcov}_n(\bm{X}, y) = \frac{1}{n^2} \sum_{k=1}^{n} \sum_{l=1}^{n} A_{kl} B_{kl},

where

Akl=aklaˉk.aˉ.l+aˉ..A_{kl} = a_{kl} - \bar{a}_{k.} - \bar{a}_{.l} + \bar{a}_{..}

is the double centering (Szekely, Rizzo, and Bakirov 2007) of akl=XkXl,a_{kl} = \| \bm{X}_k - \bm{X}_l \|, and

Bkl=1r=1RI(Yk=r)I(Yl=r)/p^rB_{kl} = 1 - \sum_{r=1}^{R} I(Y_k = r) I(Y_l = r) / \hat{p}_r

with p^r=nr/n=n1i=1nI(Yi=r)\hat{p}_r = n_r / n = n^{-1}\sum_{i=1}^{n} I(Y_i = r). The semi-distance correlation statistic is

SDcorn(X,y)=SDcovn(X,y)dvarn(X)R1,\text{SDcor}_n(\bm{X}, y) = \dfrac{\text{SDcov}_n(\bm{X}, y)}{\text{dvar}_n(\bm{X})\sqrt{R - 1}},

where dvarn(X)\text{dvar}_n(\bm{X}) is the V-statistic of distance variance of X\bm{X}.

If type = "U", then the semi-distance covariance statistic is computed as an “estimated U-statistic”, which is utilized in the independence test statistic and is not necessarily non-negative. Specifically,

SDcov~n(X,y)=1n(n1)ijXiXj(1r=1RI(Yi=r)I(Yj=r)/p~r),\widetilde{\text{SDcov}}_n(\bm{X}, y) = \frac{1}{n(n-1)} \sum_{i \ne j} \| \bm{X}_i - \bm{X}_j \| \left(1 - \sum_{r=1}^{R} I(Y_i = r) I(Y_j = r) / \tilde{p}_r\right),

where p~r=(nr1)/(n1)=(n1)1(i=1nI(Yi=r)1)\tilde{p}_r = (n_r-1) / (n-1) = (n-1)^{-1}(\sum_{i=1}^{n} I(Y_i = r) - 1). Note that the test statistic of the semi-distance independence test is

Tn=nSDcov~n(X,y).T_n = n \cdot \widetilde{\text{SDcov}}_n(\bm{X}, y).

Value

The value of the corresponding sample statistic.

If the argument return_mat of sdcov() is set as TRUE, a list with elements

will be returned.

See Also

Examples

X <- mtcars[, c("mpg", "disp", "drat", "wt")]
y <- factor(mtcars[, "am"])
print(sdcov(X, y))
print(sdcor(X, y))

# Man-made independent data -------------------------------------------------
n <- 30; R <- 5; p <- 3; prob <- rep(1/R, R)
X <- matrix(rnorm(n*p), n, p)
y <- factor(sample(1:R, size = n, replace = TRUE, prob = prob), levels = 1:R)
print(sdcov(X, y))
print(sdcor(X, y))

# Man-made functionally dependent data --------------------------------------
n <- 30; R <- 3; p <- 3
X <- matrix(0, n, p)
X[1:10, 1] <- 1; X[11:20, 2] <- 1; X[21:30, 3] <- 1
y <- factor(rep(1:3, each = 10))
print(sdcov(X, y))
print(sdcor(X, y))


[Package semidist version 0.1.0 Index]