R: Split Data into Partitions

partition {splitTools}

R Documentation

Split Data into Partitions

Description

This function provides row indices for data splitting, e.g., to split data into training, validation, and test. Different types of split strategies are supported, see Details. The partition indices are either returned as list with one element per partition (the default) or as vector of partition IDs.

Usage

partition(
  y,
  p,
  type = c("stratified", "basic", "grouped", "blocked"),
  n_bins = 10L,
  split_into_list = TRUE,
  use_names = TRUE,
  shuffle = FALSE,
  seed = NULL
)

Arguments

`y`	Either the variable used for "stratification" or "grouped" splits. For other types of splits, any vector of the same length as the data intended to split.
`p`	A vector with split probabilities per partition, e.g., `c(train = 0.7, valid = 0.3)`. Names are passed to the output.
`type`	Split type. One of "stratified" (default), "basic", "grouped", "blocked".
`n_bins`	Approximate numbers of bins for numeric `y` (only for `type = "stratified"`).
`split_into_list`	Should the resulting partition vector be split into a list? Default is `TRUE`.
`use_names`	Should names of `p` be used as partition names? Default is `TRUE`.
`shuffle`	Should row indices be randomly shuffled within partition? Default is `FALSE`. Shuffling is only possible when `split_into_list = TRUE`.
`seed`	Integer random seed.

Details

By default, the function uses stratified splitting. This will balance the partitions as good as possible regarding the distribution of the input vector y. (Numeric input is first binned into n_bins quantile groups.) If type = "grouped", groups specified by y are kept together when splitting. This is relevant for clustered or panel data. In contrast to basic splitting, type = "blocked" does not sample indices at random, but rather keeps them in groups: e.g., the first 80% of observations form a training set and the remaining 20% are used for testing.

Value

A list with row indices per partition (if split_into_list = TRUE) or a vector of partition IDs.

Examples

y <- rep(c(letters[1:4]), each = 5)
partition(y, p = c(0.7, 0.3), seed = 1)
partition(y, p = c(0.7, 0.3), split_into_list = FALSE, seed = 1)
p <- c(train = 0.8, valid = 0.1, test = 0.1)
partition(y, p, seed = 1)
partition(y, p, split_into_list = FALSE, seed = 1)
partition(y, p, split_into_list = FALSE, use_names = FALSE, seed = 1)
partition(y, p = c(0.7, 0.3), type = "grouped")
partition(y, p = c(0.7, 0.3), type = "blocked")

[Package splitTools version 1.0.1 Index]