twin {twinning}R Documentation

Partition datasets into statistcally similar twin sets

Description

twin() implements the twinning algorithm presented in Vakayil and Joseph (2022). A partition of the dataset is returned, such that the resulting two disjoint sets, termed as twins, are distributed similar to each other, as well as the whole dataset. Such a partition is an optimal training-testing split (Joseph and Vakayil, 2021) for training and testing statistical and machine learning models, and is model-independent. The statistical similarity also allows one to treat either of the twins as a compression (lossy) of the dataset for tractable model building on Big Data.

Usage

twin(data, r, u1 = NULL, format_data = TRUE, leaf_size = 8)

Arguments

data

The dataset including both the predictors and response(s); should not contain missing values, and only numeric and/or factor column(s) are allowed.

r

An integer representing the inverse of the splitting ratio, e.g., for an 80-20 partition, r = 1 / 0.2 = 5.

u1

Index of the data point from where twinning starts; if not provided, twinning starts from a random point in the dataset. Fixing u1 makes twinning deterministic, i.e., the same twins are returned.

format_data

If set to TRUE, constant columns in data are removed, factor columns are converted to numerical using Helmert coding, and then the columns are scaled to zero mean and unit standard deviation. If set to FALSE, the user is expected to perform data pre-processing.

leaf_size

Maximum number of elements in the leaf-nodes of the kd-tree.

Details

The twinning algorithm requires nearest neighbor queries that are performed using a kd-tree. The kd-tree implementation in the nanoflann (Blanco and Rai, 2014) C++ library is used.

Value

Indices of the smaller twin.

References

Vakayil, A., & Joseph, V. R. (2022). Data Twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, to appear. arXiv preprint arXiv:2110.02927.

Joseph, V. R., & Vakayil, A. (2021). SPlit: An Optimal Method for Data Splitting. Technometrics, 1-11. doi:10.1080/00401706.2021.1921037.

Blanco, J. L. & Rai, P. K. (2014). nanoflann: a C++ header-only fork of FLANN, a library for nearest neighbor (NN) with kd-trees. https://github.com/jlblancoc/nanoflann.

Examples

## 1. An 80-20 partition of a numeric dataset
X = rnorm(n=100, mean=0, sd=1)
Y = rnorm(n=100, mean=X^2, sd=1)
data = cbind(X, Y)
twin1_indices = twin(data, r=5) 
twin1 = data[twin1_indices, ]
twin2 = data[-twin1_indices, ]
plot(data, main="Smaller Twin")
points(twin1, col="green", cex=2)

## 2. An 80-20 split of the iris dataset
twin1_indices = twin(iris, r=5)
twin1 = iris[twin1_indices, ]
twin2 = iris[-twin1_indices, ]


[Package twinning version 1.0 Index]