R: Resampling for comparing training on same or other groups

ResamplingSameOtherCV {mlr3resampling}

R Documentation

Resampling for comparing training on same or other groups

Description

ResamplingSameOtherCV defines how a task is partitioned for resampling, for example in resample() or benchmark().

Resampling objects can be instantiated on a Task, which should define at least one group variable.

After instantiation, sets can be accessed via ⁠$train_set(i)⁠ and ⁠$test_set(i)⁠, respectively.

Details

A supervised learning algorithm inputs a train set, and outputs a prediction function, which can be used on a test set. If each data point belongs to a group (such as geographic region, year, etc), then how do we know if it is possible to train on one group, and predict accurately on another group? Cross-validation can be used to determine the extent to which this is possible, by first assigning fold IDs from 1 to K to all data (possibly using stratification, usually by group and label). Then we loop over test sets (group/fold combinations), train sets (same group, other groups, all groups), and compute test/prediction accuracy for each combination. Comparing test/prediction accuracy between same and other, we can determine the extent to which it is possible (perfect if same/other have similar test accuracy for each group; other is usually somewhat less accurate than same; other can be just as bad as featureless baseline when the groups have different patterns).

Stratification

ResamplingSameOtherCV supports stratified sampling. The stratification variables are assumed to be discrete, and must be stored in the Task with column role "stratum". In case of multiple stratification variables, each combination of the values of the stratification variables forms a stratum.

Grouping

ResamplingSameOtherCV supports grouping of observations. The grouping variable is assumed to be discrete, and must be stored in the Task with column role "group".

The number of cross-validation folds K should be defined as the fold parameter.

In each group, there will be about an equal number of observations assigned to each of the K folds. The assignments are stored in ⁠$instance$id.dt⁠. The train/test splits are defined by all possible combinations of test group, test fold, and train groups (same/other/all). The splits are stored in ⁠$instance$iteration.dt⁠.

Methods

Method `new()`

Creates a new instance of this R6 class.

Usage

Resampling$new(
  id,
  param_set = ps(),
  duplicated_ids = FALSE,
  label = NA_character_,
  man = NA_character_
)

Arguments

id: (character(1))
Identifier for the new instance.
param_set: (paradox::ParamSet)
Set of hyperparameters.
duplicated_ids: (logical(1))
Set to TRUE if this resampling strategy may have duplicated row ids in a single training set or test set.
label: (character(1))
Label for the new instance.
man: (character(1))
String in the format ⁠[pkg]::[topic]⁠ pointing to a manual page for this object. The referenced help package can be opened via method ⁠$help()⁠.

Method `train_set()`

Returns the row ids of the i-th training set.

Usage

Resampling$train_set(i)

Arguments

i: (integer(1))
Iteration.

Returns

(integer()) of row ids.

Method `test_set()`

Returns the row ids of the i-th test set.

Usage

Resampling$test_set(i)

Arguments

i: (integer(1))
Iteration.

Returns

(integer()) of row ids.

Examples

same_other <- mlr3resampling::ResamplingSameOtherCV$new()
same_other$param_set$values$folds <- 5

[Package mlr3resampling version 2024.7.7 Index]

Resampling for comparing training on same or other groups

Description

Details

Stratification

Grouping

Methods

Public methods

Method new()

Usage

Arguments

Method train_set()

Usage

Arguments

Returns

Method test_set()

Usage

Arguments

Returns

See Also

Examples

Method `new()`

Method `train_set()`

Method `test_set()`