R: Task Preprocessing Base Class

PipeOpTaskPreproc {mlr3pipelines}

R Documentation

Task Preprocessing Base Class

Description

Base class for handling most "preprocessing" operations. These are operations that have exactly one Task input and one Task output, and expect the column layout of these Tasks during input and output to be the same.

Prediction-behavior of preprocessing operations should always be independent for each row in the input-Task. This means that the prediction-operation of preprocessing-PipeOps should commute with rbind(): Running prediction on an n-row Task should result in the same result as rbind()-ing the prediction-result from n 1-row Tasks with the same content. In the large majority of cases, the number and order of rows should also not be changed during prediction.

Users must implement private$.train_task() and private$.predict_task(), which have a Task input and should return that Task. The Task should, if possible, be manipulated in-place, and should not be cloned.

Alternatively, the private$.train_dt() and private$.predict_dt() functions can be implemented, which operate on data.table objects instead. This should generally only be done if all data is in some way altered (e.g. PCA changing all columns to principal components) and not if only a few columns are added or removed (e.g. feature selection) because this should be done at the Task-level with private$.train_task(). The private$.select_cols() function can be overloaded for private$.train_dt() and private$.predict_dt() to operate only on subsets of the Task's data, e.g. only on numerical columns.

If the can_subset_cols argument of the constructor is TRUE (the default), then the hyperparameter affect_columns is added, which can limit the columns of the Task that is modified by the PipeOpTaskPreproc using a Selector function. Note this functionality is entirely independent of the private$.select_cols() functionality.

PipeOpTaskPreproc is useful for operations that behave differently during training and prediction. For operations that perform essentially the same operation and only need to perform extra work to build a ⁠$state⁠ during training, the PipeOpTaskPreprocSimple class can be used instead.

Format

Abstract R6Class inheriting from PipeOp.

Construction

PipeOpTaskPreproc$new(id, param_set = ps(), param_vals = list(), can_subset_cols = TRUE,
  packages = character(0), task_type = "Task", tags = NULL, feature_types = mlr_reflections$task_feature_types)

id :: character(1)
Identifier of resulting object. See ⁠$id⁠ slot of PipeOp.
param_set :: ParamSet
Parameter space description. This should be created by the subclass and given to super$initialize().
param_vals :: named list
List of hyperparameter settings, overwriting the hyperparameter settings given in param_set. The subclass should have its own param_vals parameter and pass it on to super$initialize(). Default list().
can_subset_cols :: logical(1)
Whether the affect_columns parameter should be added which lets the user limit the columns that are modified by the PipeOpTaskPreproc. This should generally be FALSE if the operation adds or removes rows from the Task, and TRUE otherwise. Default is TRUE.
packages :: character
Set of all required packages for the PipeOp's private$.train() and private$.predict() methods. See ⁠$packages⁠ slot. Default is character(0).
task_type :: character(1)
The class of Task that should be accepted as input and will be returned as output. This should generally be a character(1) identifying a type of Task, e.g. "Task", "TaskClassif" or "TaskRegr" (or another subclass introduced by other packages). Default is "Task".
tags :: character | NULL
Tags of the resulting PipeOp. This is added to the tag "data transform". Default NULL.
feature_types :: character
Feature types affected by the PipeOp. See private$.select_cols() for more information. Defaults to all available feature types.

Input and Output Channels

PipeOpTaskPreproc has one input channel named "input", taking a Task, or a subclass of Task if the task_type construction argument is given as such; both during training and prediction.

PipeOpTaskPreproc has one output channel named "output", producing a Task, or a subclass; the Task type is the same as for input; both during training and prediction.

The output Task is the modified input Task according to the overloaded private$.train_task()/private$.predict_taks() or private$.train_dt()/private$.predict_dt() functions.

State

The ⁠$state⁠ is a named list; besides members added by inheriting classes, the members are:

affect_cols :: character
Names of features being selected by the affect_columns parameter, if present; names of all present features otherwise.
intasklayout :: data.table
Copy of the training Task's ⁠$feature_types⁠ slot. This is used during prediction to ensure that the prediction Task has the same features, feature layout, and feature types as during training.
outtasklayout :: data.table
Copy of the trained Task's ⁠$feature_types⁠ slot. This is used during prediction to ensure that the Task resulting from the prediction operation has the same features, feature layout, and feature types as after training.
dt_columns :: character
Names of features selected by the private$.select_cols() call during training. This is only present if the private$.train_dt() functionality is used, and not present if the private$.train_task() function is overloaded instead.
feature_types :: character
Feature types affected by the PipeOp. See private$.select_cols() for more information.

Parameters

affect_columns :: function | Selector | NULL
What columns the PipeOpTaskPreproc should operate on. This parameter is only present if the constructor is called with the can_subset_cols argument set to TRUE (the default).
The parameter must be a Selector function, which takes a Task as argument and returns a character of features to use.
See Selector for example functions. Defaults to NULL, which selects all features.

Internals

PipeOpTaskPreproc is an abstract class inheriting from PipeOp. It implements the private$.train() and ⁠$.predict()⁠ functions. These functions perform checks and go on to call private$.train_task() and private$.predict_task(). A subclass of PipeOpTaskPreproc may implement these functions, or implement private$.train_dt() and private$.predict_dt() instead. This works by having the default implementations of private$.train_task() and private$.predict_task() call private$.train_dt() and private$.predict_dt(), respectively.

The affect_columns functionality works by unsetting columns by removing their "col_role" before processing, and adding them afterwards by setting the col_role to "feature".

Fields

Fields inherited from PipeOp.

Methods

Methods inherited from PipeOp, as well as:

.train_task
(Task) -> Task
Called by the PipeOpTaskPreproc's implementation of private$.train(). Takes a single Task as input and modifies it (ideally in-place without cloning) while storing information in the ⁠$state⁠ slot. Note that unlike ⁠$.train()⁠, the argument is not a list but a singular Task, and the return object is also not a list but a singular Task. Also, contrary to private$.train(), the ⁠$state⁠ being generated must be a list, which the PipeOpTaskPreproc will add additional slots to (see Section State). Care should be taken to avoid name collisions between ⁠$state⁠ elements added by private$.train_task() and PipeOpTaskPreproc.
By default this function calls the private$.train_dt() function, but it can be overloaded to perform operations on the Task directly.
.predict_task
(Task) -> Task
Called by the PipeOpTaskPreproc's implementation of ⁠$.predict()⁠. Takes a single Task as input and modifies it (ideally in-place without cloning) while using information in the ⁠$state⁠ slot. Works analogously to private$.train_task(). If private$.predict_task() should only be overloaded if private$.train_task() is overloaded (i.e. private$.train_dt() is not used).
.train_dt(dt, levels, target)
(data.table, named list, any) -> data.table | data.frame | matrix
Train PipeOpTaskPreproc on dt, transform it and store a state in ⁠$state⁠. A transformed object must be returned that can be converted to a data.table using as.data.table. dt does not need to be copied deliberately, it is possible and encouraged to change it in-place.
The levels argument is a named list of factor levels for factorial or character features. If the input Task inherits from TaskSupervised, the target argument contains the ⁠$truth()⁠ information of the training Task; its type depends on the Task type being trained on.
This method can be overloaded when inheriting from PipeOpTaskPreproc, together with private$.predict_dt() and optionally private$.select_cols(); alternatively, private$.train_task() and private$.predict_task() can be overloaded.
.predict_dt(dt, levels)
(data.table, named list) -> data.table | data.frame | matrix
Predict on new data in dt, possibly using the stored ⁠$state⁠. A transformed object must be returned that can be converted to a data.table using as.data.table. dt does not need to be copied deliberately, it is possible and encouraged to change it in-place.
The levels argument is a named list of factor levels for factorial or character features.
This method can be overloaded when inheriting PipeOpTaskPreproc, together with private$.train_dt() and optionally private$.select_cols(); alternatively, private$.train_task() and private$.predict_task() can be overloaded.
.select_cols(task)
(Task) -> character
Selects which columns the PipeOp operates on, if private$.train_dt() and private$.predict_dt() are overloaded. This function is not called if private$.train_task() and private$.predict_task() are overloaded. In contrast to the affect_columns parameter. private$.select_cols() is for the inheriting class to determine which columns the operator should function on, e.g. based on feature type, while affect_columns is a way for the user to limit the columns that a PipeOpTaskPreproc should operate on.
This method can optionally be overloaded when inheriting PipeOpTaskPreproc, together with private$.train_dt() and private$.predict_dt(); alternatively, private$.train_task() and private$.predict_task() can be overloaded.
If this method is not overloaded, it defaults to selecting of type indicated by the feature_types construction argument.