multiplet {twinning}R Documentation

Partition datasets into multiple statistcally similar disjoint sets

Description

multiplet() extends twin() to partition datasets into multiple statistically similar disjoint sets, termed as multiplets, under the three different strategies described in Vakayil and Joseph (2022).

Usage

multiplet(data, k, strategy = 1, format_data = TRUE, leaf_size = 8)

Arguments

data

The dataset including both the predictors and response(s); should not contain missing values, and only numeric and/or factor column(s) are allowed.

k

The desired number of multiplets.

strategy

An integer either 1, 2, or 3 referring to the three strategies for generating multiplets. Strategy 2 perfroms best, but requires k to be a power of 2. Strategy 3 is computatioanlly inexpensive, but performs worse than strategies 1 and 2.

format_data

If set to TRUE, constant columns in data are removed, factor columns are converted to numerical using Helmert coding, and then the columns are scaled to zero mean and unit standard deviation. If set to FALSE, the user is expected to perform data pre-processing.

leaf_size

Maximum number of elements in the leaf-nodes of the kd-tree.

Value

List with the multiplet id, ranging from 1 to k, for each row in data.

References

Vakayil, A., & Joseph, V. R. (2022). Data Twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal, to appear. arXiv preprint arXiv:2110.02927.

Blanco, J. L. & Rai, P. K. (2014). nanoflann: a C++ header-only fork of FLANN, a library for nearest neighbor (NN) with kd-trees. https://github.com/jlblancoc/nanoflann.

Examples

## 1. Generating 10 multiplets of a numeric dataset
X = rnorm(n=100, mean=0, sd=1)
Y = rnorm(n=100, mean=X^2, sd=1)
data = cbind(X, Y)
multiplet_idx = multiplet(data, k=10) 
multiplet_1 = data[which(multiplet_idx == 1), ]
multiplet_10 = data[which(multiplet_idx == 10), ]

## 2. Generating 4 multiplets of the iris dataset using strategy 2
multiplet_idx = multiplet(iris, k=4, strategy=2)
multiplet_1 = iris[which(multiplet_idx == 1), ]
multiplet_4 = iris[which(multiplet_idx == 4), ]


[Package twinning version 1.0 Index]