R: Split a dataset for training and testing

SPlit {SPlit}

R Documentation

Split a dataset for training and testing

Description

SPlit() implements the optimal data splitting procedure described in Joseph and Vakayil (2021). SPlit can be applied to both regression and classification problems, and is model-independent. As a preprocessing step, the nominal categorical columns in the dataset must be declared as factors, and the ordinal categorical columns must be converted to numeric using scoring.

Usage

SPlit(
  data,
  splitRatio = 0.2,
  kappa = NULL,
  maxIterations = 500,
  tolerance = 1e-10,
  nThreads
)

Arguments

`data`	The dataset including both the predictors and response(s); should not contain missing values, and only numeric and/or factor column(s) are allowed.
`splitRatio`	The ratio in which the dataset is to be split; should be in (0, 1) e.g. for an 80-20 split, the `splitRatio` is either 0.8 or 0.2.
`kappa`	If provided, stochastic majorization-minimization is used for computing support points using a random sample from the dataset of size = `ceiling(kappa * splitRatio * nrow(data))`, in every iteration.
`maxIterations`	The maximum number of iterations before the tolerance level is reached during support points optimization.
`tolerance`	The tolerance level for support points optimization; measured in terms of the maximum point-wise difference in distance between successive solutions.
`nThreads`	Number of threads to be used for parallel computation; if not supplied, `nThreads` defaults to maximum available threads.

Details

Support points are defined only for continuous variables. The categorical variables are handled as follows. SPlit() will automatically convert a nominal categorical variable with m levels to m-1 continuous variables using Helmert coding. Ordinal categorical variables should be converted to numerical columns using a scoring method before using SPlit(). For example, if the three levels of an ordinal variable are poor, good, and excellent, then the user may choose 1, 2, and 5 to represent the three levels. These values depend on the problem and data collection method, and therefore, SPlit() will not do it automatically. The columns of the resulting numeric dataset are then standardized to have mean zero and variance one. SPlit() then computes the support points and calls the provided subsample() function to perform a nearest neighbor subsampling. The indices of this subsample are returned.

SPlit can be time consuming for large datasets. The computational time can be reduced by using the stochastic majorization-minimization technique with a trade-off in the quality of the split. For example, setting kappa = 2 will use a random sample, twice the size of the smaller subset in the split, instead of using the whole dataset in every iteration of the support points optimization. Another option for large datasets is to use data twinning (Vakayil and Joseph, 2022) implemented in the R package twinning. Twinning is extremely fast, but for small datasets, the results may not be as good as SPlit.

Value

Indices of the smaller subset in the split.

References

Joseph, V. R., & Vakayil, A. (2021). SPlit: An Optimal Method for Data Splitting. Technometrics, 1-11. doi:10.1080/00401706.2021.1921037.

Vakayil, A., & Joseph, V. R. (2022). Data Twinning. Statistical Analysis and Data Mining: The ASA Data Science Journal. https://doi.org/10.1002/sam.11574.

Mak, S., & Joseph, V. R. (2018). Support points. The Annals of Statistics, 46(6A), 2562-2592.

Examples

## 1. An 80-20 split of a numeric dataset
X = rnorm(n = 100, mean = 0, sd = 1)
Y = rnorm(n = 100, mean = X^2, sd = 1)
data = cbind(X, Y)
SPlitIndices = SPlit(data, tolerance = 1e-6, nThreads = 2) 
dataTest = data[SPlitIndices, ]
dataTrain = data[-SPlitIndices, ]
plot(data, main = "SPlit testing set")
points(dataTest, col = 'green', cex = 2)

## 2. An 80-20 split of the iris dataset
SPlitIndices = SPlit(iris, nThreads = 2)
irisTest = iris[SPlitIndices, ]
irisTrain = iris[-SPlitIndices, ]

[Package SPlit version 1.2 Index]