partition {splitTools} | R Documentation |
Split Data into Partitions
Description
This function provides row indices for data splitting, e.g., to split data into training, validation, and test. Different types of split strategies are supported, see Details. The partition indices are either returned as list with one element per partition (the default) or as vector of partition IDs.
Usage
partition(
y,
p,
type = c("stratified", "basic", "grouped", "blocked"),
n_bins = 10L,
split_into_list = TRUE,
use_names = TRUE,
shuffle = FALSE,
seed = NULL
)
Arguments
y |
Either the variable used for "stratification" or "grouped" splits. For other types of splits, any vector of the same length as the data intended to split. |
p |
A vector with split probabilities per partition, e.g.,
|
type |
Split type. One of "stratified" (default), "basic", "grouped", "blocked". |
n_bins |
Approximate numbers of bins for numeric |
split_into_list |
Should the resulting partition vector be split into a list?
Default is |
use_names |
Should names of |
shuffle |
Should row indices be randomly shuffled within partition?
Default is |
seed |
Integer random seed. |
Details
By default, the function uses stratified splitting. This will balance the partitions
as good as possible regarding the distribution of the input vector y
.
(Numeric input is first binned into n_bins
quantile groups.)
If type = "grouped"
, groups specified by y
are kept together when
splitting. This is relevant for clustered or panel data.
In contrast to basic splitting, type = "blocked"
does not sample indices
at random, but rather keeps them in groups: e.g., the first 80% of observations form
a training set and the remaining 20% are used for testing.
Value
A list with row indices per partition (if split_into_list = TRUE
)
or a vector of partition IDs.
See Also
Examples
y <- rep(c(letters[1:4]), each = 5)
partition(y, p = c(0.7, 0.3), seed = 1)
partition(y, p = c(0.7, 0.3), split_into_list = FALSE, seed = 1)
p <- c(train = 0.8, valid = 0.1, test = 0.1)
partition(y, p, seed = 1)
partition(y, p, split_into_list = FALSE, seed = 1)
partition(y, p, split_into_list = FALSE, use_names = FALSE, seed = 1)
partition(y, p = c(0.7, 0.3), type = "grouped")
partition(y, p = c(0.7, 0.3), type = "blocked")