energy {twinning} | R Documentation |
Energy distance computation
Description
energy()
computes the energy distance (Székely and Rizzo, 2013) between a given dataset and a set of points in same dimensions.
Usage
energy(data, points)
Arguments
data |
The dataset including both the predictors and response(s). A numeric matrix is expected. If the dataset has factor columns, the user is expected to convert them to numeric using a coding method. |
points |
The set of points for which the energy distance with respect to |
Details
Smaller the energy distance, the more statistically similar the set of points is to the given dataset. The minimizer of energy distance is known as support points (Mak and Joseph, 2018), which is the basis of the twinning method. Computing energy distance between data
and points
involves Euclidean distance calculations among the rows of data
, among the rows of points
, and between the rows of data
and points
. Since, data
serves as the reference, the distance calculations among the rows of data
are ignored for efficiency. Before computing the energy distance, the columns of data
are scaled to zero mean and unit standard deviation. The mean and standard deviation of the columns of data
are used to scale the respective columns in points
.
Value
Energy distance.
References
Vakayil, A., & Joseph, V. R. (2022). Data Twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, to appear. arXiv preprint arXiv:2110.02927.
Székely, G. J., & Rizzo, M. L. (2013). Energy statistics: A class of statistics based on distances. Journal of statistical planning and inference, 143(8), 1249-1272.
Mak, S. & Joseph, V. R. (2018). Support Points. Annals of Statistics, 46, 2562-2592.
Examples
## Energy distance between a dataset and a random sample
X = rnorm(n=100, mean=0, sd=1)
Y = rnorm(n=100, mean=X^2, sd=1)
data = cbind(X, Y)
energy(data, data[sample(100, 20), ])