ResamplingVariableSizeTrainCV {mlr3resampling}R Documentation

Resampling for comparing training on same or other groups

Description

ResamplingVariableSizeTrainCV defines how a task is partitioned for resampling, for example in resample() or benchmark().

Resampling objects can be instantiated on a Task.

After instantiation, sets can be accessed via ⁠$train_set(i)⁠ and ⁠$test_set(i)⁠, respectively.

Details

A supervised learning algorithm inputs a train set, and outputs a prediction function, which can be used on a test set. How many train samples are required to get accurate predictions on a test set? Cross-validation can be used to answer this question, with variable size train sets.

Stratification

ResamplingVariableSizeTrainCV supports stratified sampling. The stratification variables are assumed to be discrete, and must be stored in the Task with column role "stratum". In case of multiple stratification variables, each combination of the values of the stratification variables forms a stratum.

Grouping

ResamplingVariableSizeTrainCV does not support grouping of observations.

Hyper-parameters

The number of cross-validation folds should be defined as the fold parameter.

For each fold ID, the corresponding observations are considered the test set, and a variable number of other observations are considered the train set.

The random_seeds parameter controls the number of random orderings of the train set that are considered.

For each random order of the train set, the min_train_data parameter controls the size of the smallest stratum in the smallest train set considered.

To determine the other train set sizes, we use an equally spaced grid on the log scale, from min_train_data to the largest train set size (all data not in test set). The number of train set sizes in this grid is determined by the train_sizes parameter.

Methods

Public methods


Method new()

Creates a new instance of this R6 class.

Usage
Resampling$new(
  id,
  param_set = ps(),
  duplicated_ids = FALSE,
  label = NA_character_,
  man = NA_character_
)
Arguments
id

(character(1))
Identifier for the new instance.

param_set

(paradox::ParamSet)
Set of hyperparameters.

duplicated_ids

(logical(1))
Set to TRUE if this resampling strategy may have duplicated row ids in a single training set or test set.

label

(character(1))
Label for the new instance.

man

(character(1))
String in the format ⁠[pkg]::[topic]⁠ pointing to a manual page for this object. The referenced help package can be opened via method ⁠$help()⁠.


Method train_set()

Returns the row ids of the i-th training set.

Usage
Resampling$train_set(i)
Arguments
i

(integer(1))
Iteration.

Returns

(integer()) of row ids.


Method test_set()

Returns the row ids of the i-th test set.

Usage
Resampling$test_set(i)
Arguments
i

(integer(1))
Iteration.

Returns

(integer()) of row ids.

Examples

(var_sizes <- mlr3resampling::ResamplingVariableSizeTrainCV$new())

[Package mlr3resampling version 2024.7.7 Index]