partition {splitTools}R Documentation

Split Data into Partitions

Description

This function provides row indices for data splitting, e.g., to split data into training, validation, and test. Different types of split strategies are supported, see Details. The partition indices are either returned as list with one element per partition (the default) or as vector of partition IDs.

Usage

partition(
  y,
  p,
  type = c("stratified", "basic", "grouped", "blocked"),
  n_bins = 10L,
  split_into_list = TRUE,
  use_names = TRUE,
  shuffle = FALSE,
  seed = NULL
)

Arguments

y

Either the variable used for "stratification" or "grouped" splits. For other types of splits, any vector of the same length as the data intended to split.

p

A vector with split probabilities per partition, e.g., c(train = 0.7, valid = 0.3). Names are passed to the output.

type

Split type. One of "stratified" (default), "basic", "grouped", "blocked".

n_bins

Approximate numbers of bins for numeric y (only for type = "stratified").

split_into_list

Should the resulting partition vector be split into a list? Default is TRUE.

use_names

Should names of p be used as partition names? Default is TRUE.

shuffle

Should row indices be randomly shuffled within partition? Default is FALSE. Shuffling is only possible when split_into_list = TRUE.

seed

Integer random seed.

Details

By default, the function uses stratified splitting. This will balance the partitions as good as possible regarding the distribution of the input vector y. (Numeric input is first binned into n_bins quantile groups.) If type = "grouped", groups specified by y are kept together when splitting. This is relevant for clustered or panel data. In contrast to basic splitting, type = "blocked" does not sample indices at random, but rather keeps them in groups: e.g., the first 80% of observations form a training set and the remaining 20% are used for testing.

Value

A list with row indices per partition (if split_into_list = TRUE) or a vector of partition IDs.

See Also

create_folds()

Examples

y <- rep(c(letters[1:4]), each = 5)
partition(y, p = c(0.7, 0.3), seed = 1)
partition(y, p = c(0.7, 0.3), split_into_list = FALSE, seed = 1)
p <- c(train = 0.8, valid = 0.1, test = 0.1)
partition(y, p, seed = 1)
partition(y, p, split_into_list = FALSE, seed = 1)
partition(y, p, split_into_list = FALSE, use_names = FALSE, seed = 1)
partition(y, p = c(0.7, 0.3), type = "grouped")
partition(y, p = c(0.7, 0.3), type = "blocked")

[Package splitTools version 1.0.1 Index]