ResamplingSameOtherSizesCV {mlr3resampling} | R Documentation |
Resampling for comparing train subsets and sizes
Description
ResamplingSameOtherSizesCV
defines how a task is partitioned for
resampling, for example in
resample()
or
benchmark()
.
Resampling objects can be instantiated on a
Task
,
which should define at least one group variable.
After instantiation, sets can be accessed via
$train_set(i)
and
$test_set(i)
, respectively.
Details
A supervised learning algorithm inputs a train set, and outputs a prediction function, which can be used on a test set. If each data point belongs to a group (such as geographic region, year, etc), then how do we know if it is possible to train on one group, and predict accurately on another group? Cross-validation can be used to determine the extent to which this is possible, by first assigning fold IDs from 1 to K to all data (possibly using stratification, usually by group and label). Then we loop over test sets (group/fold combinations), train sets (same group, other groups, all groups), and compute test/prediction accuracy for each combination. Comparing test/prediction accuracy between same and other, we can determine the extent to which it is possible (perfect if same/other have similar test accuracy for each group; other is usually somewhat less accurate than same; other can be just as bad as featureless baseline when the groups have different patterns).
This class has more parameters/potential applications than
ResamplingSameOtherCV
and
ResamplingVariableSizeTrainCV
,
which are older and should only be preferred
for visualization purposes.
Stratification
ResamplingSameOtherSizesCV
supports stratified sampling.
The stratification variables are assumed to be discrete,
and must be stored in the Task with column role "stratum"
.
In case of multiple stratification variables,
each combination of the values of the stratification variables forms a stratum.
Grouping
ResamplingSameOtherSizesCV
supports grouping of observations.
The grouping variable is assumed to be discrete,
and must be stored in the Task with column role "group"
.
Subsets
ResamplingSameOtherSizesCV
supports training on different
subsets of observations.
The subset variable is assumed to be discrete,
and must be stored in the Task with column role "subset"
.
Parameters
The number of cross-validation folds K should be defined as the
fold
parameter, default 3.
The number of random seeds for down-sampling should be defined as the
seeds
parameter, default 1.
The ratio for down-sampling should be defined as the ratio
parameter, default 0.5. The min size of same and other sets is
repeatedly multiplied by this ratio, to obtain smaller sample sizes.
The number of down-sampling sizes/multiplications should be defined as
the sizes
parameter, which can also take two special values:
default -1 means no down-sampling at all, and 0 means only down-sampling
to the sizes of the same/other sets.
The ignore_subset
parameter should be either TRUE
or
FALSE
(default), whether to ignore the subset
role. TRUE
only creates splits for same subset (even if task
defines subset
role), and is useful for subtrain/validation
splits (hyper-parameter learning). Note that this feature will work on a
task with stratum
and group
roles (unlike
ResamplingCV
).
In each subset, there will be about an equal number of observations
assigned to each of the K folds.
The train/test splits are defined by all possible combinations of
test subset, test fold, train subsets (same/other/all), down-sampling
sizes, and random seeds.
The splits are stored in
$instance$iteration.dt
.
Methods
Public methods
Method new()
Creates a new instance of this R6 class.
Usage
Resampling$new( id, param_set = ps(), duplicated_ids = FALSE, label = NA_character_, man = NA_character_ )
Arguments
id
(
character(1)
)
Identifier for the new instance.param_set
(paradox::ParamSet)
Set of hyperparameters.duplicated_ids
(
logical(1)
)
Set toTRUE
if this resampling strategy may have duplicated row ids in a single training set or test set.label
(
character(1)
)
Label for the new instance.man
(
character(1)
)
String in the format[pkg]::[topic]
pointing to a manual page for this object. The referenced help package can be opened via method$help()
.
Method train_set()
Returns the row ids of the i-th training set.
Usage
Resampling$train_set(i)
Arguments
i
(
integer(1)
)
Iteration.
Returns
(integer()
) of row ids.
Method test_set()
Returns the row ids of the i-th test set.
Usage
Resampling$test_set(i)
Arguments
i
(
integer(1)
)
Iteration.
Returns
(integer()
) of row ids.
See Also
Blog post https://tdhock.github.io/blog/2023/R-gen-new-subsets/
Package mlr3 for standard
Resampling
, which does not support comparing train on same or other groups.-
score
and Simulations vignette for more detailed examples.
Examples
same_other_sizes <- mlr3resampling::ResamplingSameOtherSizesCV$new()
same_other_sizes$param_set$values$folds <- 5