R: Partition datasets into multiple statistcally similar...

multiplet {twinning}

R Documentation

Partition datasets into multiple statistcally similar disjoint sets

Description

multiplet() extends twin() to partition datasets into multiple statistically similar disjoint sets, termed as multiplets, under the three different strategies described in Vakayil and Joseph (2022).

Usage

multiplet(data, k, strategy = 1, format_data = TRUE, leaf_size = 8)

Arguments

`data`	The dataset including both the predictors and response(s); should not contain missing values, and only numeric and/or factor column(s) are allowed.
`k`	The desired number of multiplets.
`strategy`	An integer either 1, 2, or 3 referring to the three strategies for generating multiplets. Strategy 2 perfroms best, but requires `k` to be a power of 2. Strategy 3 is computatioanlly inexpensive, but performs worse than strategies 1 and 2.
`format_data`	If set to `TRUE`, constant columns in `data` are removed, factor columns are converted to numerical using Helmert coding, and then the columns are scaled to zero mean and unit standard deviation. If set to `FALSE`, the user is expected to perform data pre-processing.
`leaf_size`	Maximum number of elements in the leaf-nodes of the kd-tree.

Value

List with the multiplet id, ranging from 1 to k, for each row in data.

References

Vakayil, A., & Joseph, V. R. (2022). Data Twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, to appear. arXiv preprint arXiv:2110.02927.

Blanco, J. L. & Rai, P. K. (2014). nanoflann: a C++ header-only fork of FLANN, a library for nearest neighbor (NN) with kd-trees. https://github.com/jlblancoc/nanoflann.

Examples

## 1. Generating 10 multiplets of a numeric dataset
X = rnorm(n=100, mean=0, sd=1)
Y = rnorm(n=100, mean=X^2, sd=1)
data = cbind(X, Y)
multiplet_idx = multiplet(data, k=10) 
multiplet_1 = data[which(multiplet_idx == 1), ]
multiplet_10 = data[which(multiplet_idx == 10), ]

## 2. Generating 4 multiplets of the iris dataset using strategy 2
multiplet_idx = multiplet(iris, k=4, strategy=2)
multiplet_1 = iris[which(multiplet_idx == 1), ]
multiplet_4 = iris[which(multiplet_idx == 4), ]

[Package twinning version 1.0 Index]