R: Fast dCor and dCov for bivariate data only

dcov2d {energy}

R Documentation

Fast dCor and dCov for bivariate data only

Description

For bivariate data only, these are fast O(n log n) implementations of distance correlation and distance covariance statistics. The U-statistic for dcov^2 is unbiased; the V-statistic is the original definition in SRB 2007. These algorithms do not store the distance matrices, so they are suitable for large samples.

Usage

dcor2d(x, y, type = c("V", "U"))
dcov2d(x, y, type = c("V", "U"), all.stats = FALSE)

Arguments

`x`	numeric vector
`y`	numeric vector
`type`	"V" or "U", for V- or U-statistics
`all.stats`	logical

Details

The unbiased (squared) dcov is documented in dcovU, for multivariate data in arbitrary, not necessarily equal dimensions. dcov2d and dcor2d provide a faster O(n log n) algorithm for bivariate (x, y) only (X and Y are real-valued random vectors). The O(n log n) algorithm was proposed by Huo and Szekely (2016). The algorithm is faster above a certain sample size n. It does not store the distance matrix so the sample size can be very large.

Value

By default, dcov2d returns the V-statistic V_n = dCov_n^2(x, y), and if type="U", it returns the U-statistic, unbiased for dCov^2(X, Y). The argument all.stats=TRUE is used internally when the function is called from dcor2d.

By default, dcor2d returns dCor_n^2(x, y), and if type="U", it returns a bias-corrected estimator of squared dcor equivalent to bcdcor.

These functions do not store the distance matrices so they are helpful when sample size is large and the data is bivariate.

Note

The U-statistic U_n can be negative in the lower tail so the square root of the U-statistic is not applied. Similarly, dcor2d(x, y, "U") is bias-corrected and can be negative in the lower tail, so we do not take the square root. The original definitions of dCov and dCor (SRB2007, SR2009) were based on V-statistics, which are non-negative, and defined using the square root of V-statistics.

It has been suggested that instead of taking the square root of the U-statistic, one could take the root of |U_n| before applying the sign, but that introduces more bias than the original dCor, and should never be used.

Author(s)

Maria L. Rizzo mrizzo@bgsu.edu and Gabor J. Szekely

References

Huo, X. and Szekely, G.J. (2016). Fast computing for distance covariance. Technometrics, 58(4), 435-447.

Szekely, G.J. and Rizzo, M.L. (2014), Partial Distance Correlation with Methods for Dissimilarities. Annals of Statistics, Vol. 42 No. 6, 2382-2412.

Szekely, G.J., Rizzo, M.L., and Bakirov, N.K. (2007), Measuring and Testing Dependence by Correlation of Distances, Annals of Statistics, Vol. 35 No. 6, pp. 2769-2794.
doi:10.1214/009053607000000505

Examples

  
    ## these are equivalent, but 2d is faster for n > 50
    n <- 100
    x <- rnorm(100)
    y <- rnorm(100)
    all.equal(dcov(x, y)^2, dcov2d(x, y), check.attributes = FALSE)
    all.equal(bcdcor(x, y), dcor2d(x, y, "U"), check.attributes = FALSE)

    x <- rlnorm(400)
    y <- rexp(400)
    dcov.test(x, y, R=199)    #permutation test
    dcor.test(x, y, R=199)

[Package energy version 1.7-11 Index]