multiplet {twinning} | R Documentation |
Partition datasets into multiple statistcally similar disjoint sets
Description
multiplet()
extends twin()
to partition datasets into multiple statistically similar disjoint sets, termed as multiplets, under the three different strategies described in Vakayil and Joseph (2022).
Usage
multiplet(data, k, strategy = 1, format_data = TRUE, leaf_size = 8)
Arguments
data |
The dataset including both the predictors and response(s); should not contain missing values, and only numeric and/or factor column(s) are allowed. |
k |
The desired number of multiplets. |
strategy |
An integer either 1, 2, or 3 referring to the three strategies for generating multiplets. Strategy 2 perfroms best, but requires |
format_data |
If set to |
leaf_size |
Maximum number of elements in the leaf-nodes of the kd-tree. |
Value
List with the multiplet id, ranging from 1 to k
, for each row in data
.
References
Vakayil, A., & Joseph, V. R. (2022). Data Twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, to appear. arXiv preprint arXiv:2110.02927.
Blanco, J. L. & Rai, P. K. (2014). nanoflann: a C++ header-only fork of FLANN, a library for nearest neighbor (NN) with kd-trees. https://github.com/jlblancoc/nanoflann.
Examples
## 1. Generating 10 multiplets of a numeric dataset
X = rnorm(n=100, mean=0, sd=1)
Y = rnorm(n=100, mean=X^2, sd=1)
data = cbind(X, Y)
multiplet_idx = multiplet(data, k=10)
multiplet_1 = data[which(multiplet_idx == 1), ]
multiplet_10 = data[which(multiplet_idx == 10), ]
## 2. Generating 4 multiplets of the iris dataset using strategy 2
multiplet_idx = multiplet(iris, k=4, strategy=2)
multiplet_1 = iris[which(multiplet_idx == 1), ]
multiplet_4 = iris[which(multiplet_idx == 4), ]