R: Partition datasets into statistcally similar twin sets

twin {twinning}

R Documentation

Partition datasets into statistcally similar twin sets

Description

twin() implements the twinning algorithm presented in Vakayil and Joseph (2022). A partition of the dataset is returned, such that the resulting two disjoint sets, termed as twins, are distributed similar to each other, as well as the whole dataset. Such a partition is an optimal training-testing split (Joseph and Vakayil, 2021) for training and testing statistical and machine learning models, and is model-independent. The statistical similarity also allows one to treat either of the twins as a compression (lossy) of the dataset for tractable model building on Big Data.

Usage

twin(data, r, u1 = NULL, format_data = TRUE, leaf_size = 8)

Arguments

`data`	The dataset including both the predictors and response(s); should not contain missing values, and only numeric and/or factor column(s) are allowed.
`r`	An integer representing the inverse of the splitting ratio, e.g., for an 80-20 partition, `r = 1 / 0.2 = 5`.
`u1`	Index of the data point from where twinning starts; if not provided, twinning starts from a random point in the dataset. Fixing `u1` makes twinning deterministic, i.e., the same twins are returned.
`format_data`	If set to `TRUE`, constant columns in `data` are removed, factor columns are converted to numerical using Helmert coding, and then the columns are scaled to zero mean and unit standard deviation. If set to `FALSE`, the user is expected to perform data pre-processing.
`leaf_size`	Maximum number of elements in the leaf-nodes of the kd-tree.

Details

The twinning algorithm requires nearest neighbor queries that are performed using a kd-tree. The kd-tree implementation in the nanoflann (Blanco and Rai, 2014) C++ library is used.

Value

Indices of the smaller twin.

References

Vakayil, A., & Joseph, V. R. (2022). Data Twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, to appear. arXiv preprint arXiv:2110.02927.

Joseph, V. R., & Vakayil, A. (2021). SPlit: An Optimal Method for Data Splitting. Technometrics, 1-11. doi:10.1080/00401706.2021.1921037.

Blanco, J. L. & Rai, P. K. (2014). nanoflann: a C++ header-only fork of FLANN, a library for nearest neighbor (NN) with kd-trees. https://github.com/jlblancoc/nanoflann.

Examples

## 1. An 80-20 partition of a numeric dataset
X = rnorm(n=100, mean=0, sd=1)
Y = rnorm(n=100, mean=X^2, sd=1)
data = cbind(X, Y)
twin1_indices = twin(data, r=5) 
twin1 = data[twin1_indices, ]
twin2 = data[-twin1_indices, ]
plot(data, main="Smaller Twin")
points(twin1, col="green", cex=2)

## 2. An 80-20 split of the iris dataset
twin1_indices = twin(iris, r=5)
twin1 = iris[twin1_indices, ]
twin2 = iris[-twin1_indices, ]

[Package twinning version 1.0 Index]