twin {twinning} | R Documentation |
Partition datasets into statistcally similar twin sets
Description
twin()
implements the twinning algorithm presented in Vakayil and Joseph (2022). A partition of the dataset is returned, such that the resulting two disjoint sets, termed as twins, are distributed similar to each other, as well as the whole dataset. Such a partition is an optimal training-testing split (Joseph and Vakayil, 2021) for training and testing statistical and machine learning models, and is model-independent. The statistical similarity also allows one to treat either of the twins as a compression (lossy) of the dataset for tractable model building on Big Data.
Usage
twin(data, r, u1 = NULL, format_data = TRUE, leaf_size = 8)
Arguments
data |
The dataset including both the predictors and response(s); should not contain missing values, and only numeric and/or factor column(s) are allowed. |
r |
An integer representing the inverse of the splitting ratio, e.g., for an 80-20 partition, |
u1 |
Index of the data point from where twinning starts; if not provided, twinning starts from a random point in the dataset. Fixing |
format_data |
If set to |
leaf_size |
Maximum number of elements in the leaf-nodes of the kd-tree. |
Details
The twinning algorithm requires nearest neighbor queries that are performed using a kd-tree. The kd-tree implementation in the nanoflann
(Blanco and Rai, 2014) C++ library is used.
Value
Indices of the smaller twin.
References
Vakayil, A., & Joseph, V. R. (2022). Data Twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, to appear. arXiv preprint arXiv:2110.02927.
Joseph, V. R., & Vakayil, A. (2021). SPlit: An Optimal Method for Data Splitting. Technometrics, 1-11. doi:10.1080/00401706.2021.1921037.
Blanco, J. L. & Rai, P. K. (2014). nanoflann: a C++ header-only fork of FLANN, a library for nearest neighbor (NN) with kd-trees. https://github.com/jlblancoc/nanoflann.
Examples
## 1. An 80-20 partition of a numeric dataset
X = rnorm(n=100, mean=0, sd=1)
Y = rnorm(n=100, mean=X^2, sd=1)
data = cbind(X, Y)
twin1_indices = twin(data, r=5)
twin1 = data[twin1_indices, ]
twin2 = data[-twin1_indices, ]
plot(data, main="Smaller Twin")
points(twin1, col="green", cex=2)
## 2. An 80-20 split of the iris dataset
twin1_indices = twin(iris, r=5)
twin1 = iris[twin1_indices, ]
twin2 = iris[-twin1_indices, ]