| PipeOpTaskPreproc {mlr3pipelines} | R Documentation |
Task Preprocessing Base Class
Description
Base class for handling most "preprocessing" operations. These
are operations that have exactly one Task input and one Task output,
and expect the column layout of these Tasks during input and output
to be the same.
Prediction-behavior of preprocessing operations should always be independent for each row in the input-Task.
This means that the prediction-operation of preprocessing-PipeOps should commute with rbind(): Running prediction
on an n-row Task should result in the same result as rbind()-ing the prediction-result from n
1-row Tasks with the same content. In the large majority of cases, the number and order of rows
should also not be changed during prediction.
Users must implement private$.train_task() and private$.predict_task(), which have a Task
input and should return that Task. The Task should, if possible, be
manipulated in-place, and should not be cloned.
Alternatively, the private$.train_dt() and private$.predict_dt() functions can be implemented, which operate on
data.table objects instead. This should generally only be done if all
data is in some way altered (e.g. PCA changing all columns to principal components) and not if only
a few columns are added or removed (e.g. feature selection) because this should be done at the Task-level
with private$.train_task(). The private$.select_cols() function can be overloaded for private$.train_dt() and private$.predict_dt()
to operate only on subsets of the Task's data, e.g. only on numerical columns.
If the can_subset_cols argument of the constructor is TRUE (the default), then the hyperparameter affect_columns
is added, which can limit the columns of the Task that is modified by the PipeOpTaskPreproc
using a Selector function. Note this functionality is entirely independent of the private$.select_cols() functionality.
PipeOpTaskPreproc is useful for operations that behave differently during training and prediction. For operations
that perform essentially the same operation and only need to perform extra work to build a $state during training,
the PipeOpTaskPreprocSimple class can be used instead.
Format
Abstract R6Class inheriting from PipeOp.
Construction
PipeOpTaskPreproc$new(id, param_set = ps(), param_vals = list(), can_subset_cols = TRUE, packages = character(0), task_type = "Task", tags = NULL, feature_types = mlr_reflections$task_feature_types)
-
id::character(1)
Identifier of resulting object. See$idslot ofPipeOp. -
param_set::ParamSet
Parameter space description. This should be created by the subclass and given tosuper$initialize(). -
param_vals:: namedlist
List of hyperparameter settings, overwriting the hyperparameter settings given inparam_set. The subclass should have its ownparam_valsparameter and pass it on tosuper$initialize(). Defaultlist(). -
can_subset_cols::logical(1)
Whether theaffect_columnsparameter should be added which lets the user limit the columns that are modified by thePipeOpTaskPreproc. This should generally beFALSEif the operation adds or removes rows from theTask, andTRUEotherwise. Default isTRUE. packages ::
character
Set of all required packages for thePipeOp'sprivate$.train()andprivate$.predict()methods. See$packagesslot. Default ischaracter(0).-
task_type::character(1)
The class ofTaskthat should be accepted as input and will be returned as output. This should generally be acharacter(1)identifying a type ofTask, e.g."Task","TaskClassif"or"TaskRegr"(or another subclass introduced by other packages). Default is"Task". tags ::
character|NULL
Tags of the resultingPipeOp. This is added to the tag"data transform". DefaultNULL.-
feature_types::character
Feature types affected by thePipeOp. Seeprivate$.select_cols()for more information. Defaults to all available feature types.
Input and Output Channels
PipeOpTaskPreproc has one input channel named "input", taking a Task, or a subclass of
Task if the task_type construction argument is given as such; both during training and prediction.
PipeOpTaskPreproc has one output channel named "output", producing a Task, or a subclass;
the Task type is the same as for input; both during training and prediction.
The output Task is the modified input Task according to the overloaded
private$.train_task()/private$.predict_taks() or private$.train_dt()/private$.predict_dt() functions.
State
The $state is a named list; besides members added by inheriting classes, the members are:
-
affect_cols::character
Names of features being selected by theaffect_columnsparameter, if present; names of all present features otherwise. -
intasklayout::data.table
Copy of the trainingTask's$feature_typesslot. This is used during prediction to ensure that the predictionTaskhas the same features, feature layout, and feature types as during training. -
outtasklayout::data.table
Copy of the trainedTask's$feature_typesslot. This is used during prediction to ensure that theTaskresulting from the prediction operation has the same features, feature layout, and feature types as after training. -
dt_columns::character
Names of features selected by theprivate$.select_cols()call during training. This is only present if theprivate$.train_dt()functionality is used, and not present if theprivate$.train_task()function is overloaded instead. -
feature_types::character
Feature types affected by thePipeOp. Seeprivate$.select_cols()for more information.
Parameters
-
affect_columns::function|Selector|NULL
What columns thePipeOpTaskPreprocshould operate on. This parameter is only present if the constructor is called with thecan_subset_colsargument set toTRUE(the default).
The parameter must be aSelectorfunction, which takes aTaskas argument and returns acharacterof features to use.
SeeSelectorfor example functions. Defaults toNULL, which selects all features.
Internals
PipeOpTaskPreproc is an abstract class inheriting from PipeOp. It implements the private$.train() and
$.predict() functions. These functions perform checks and go on to call private$.train_task() and private$.predict_task().
A subclass of PipeOpTaskPreproc may implement these functions, or implement private$.train_dt() and private$.predict_dt() instead.
This works by having the default implementations of private$.train_task() and private$.predict_task() call private$.train_dt() and private$.predict_dt(),
respectively.
The affect_columns functionality works by unsetting columns by removing their "col_role" before
processing, and adding them afterwards by setting the col_role to "feature".
Fields
Fields inherited from PipeOp.
Methods
Methods inherited from PipeOp, as well as:
-
.train_task
(Task) ->Task
Called by thePipeOpTaskPreproc's implementation ofprivate$.train(). Takes a singleTaskas input and modifies it (ideally in-place without cloning) while storing information in the$stateslot. Note that unlike$.train(), the argument is not a list but a singularTask, and the return object is also not a list but a singularTask. Also, contrary toprivate$.train(), the$statebeing generated must be alist, which thePipeOpTaskPreprocwill add additional slots to (see Section State). Care should be taken to avoid name collisions between$stateelements added byprivate$.train_task()andPipeOpTaskPreproc.
By default this function calls theprivate$.train_dt()function, but it can be overloaded to perform operations on theTaskdirectly. -
.predict_task
(Task) ->Task
Called by thePipeOpTaskPreproc's implementation of$.predict(). Takes a singleTaskas input and modifies it (ideally in-place without cloning) while using information in the$stateslot. Works analogously toprivate$.train_task(). Ifprivate$.predict_task()should only be overloaded ifprivate$.train_task()is overloaded (i.e.private$.train_dt()is not used). -
.train_dt(dt, levels, target)
(data.table, namedlist,any) ->data.table|data.frame|matrix
TrainPipeOpTaskPreprocondt, transform it and store a state in$state. A transformed object must be returned that can be converted to adata.tableusingas.data.table.dtdoes not need to be copied deliberately, it is possible and encouraged to change it in-place.
Thelevelsargument is a named list of factor levels for factorial or character features. If the inputTaskinherits fromTaskSupervised, thetargetargument contains the$truth()information of the trainingTask; its type depends on theTasktype being trained on.
This method can be overloaded when inheriting fromPipeOpTaskPreproc, together withprivate$.predict_dt()and optionallyprivate$.select_cols(); alternatively,private$.train_task()andprivate$.predict_task()can be overloaded. -
.predict_dt(dt, levels)
(data.table, namedlist) ->data.table|data.frame|matrix
Predict on new data indt, possibly using the stored$state. A transformed object must be returned that can be converted to adata.tableusingas.data.table.dtdoes not need to be copied deliberately, it is possible and encouraged to change it in-place.
Thelevelsargument is a named list of factor levels for factorial or character features.
This method can be overloaded when inheritingPipeOpTaskPreproc, together withprivate$.train_dt()and optionallyprivate$.select_cols(); alternatively,private$.train_task()andprivate$.predict_task()can be overloaded. -
.select_cols(task)
(Task) ->character
Selects which columns thePipeOpoperates on, ifprivate$.train_dt()andprivate$.predict_dt()are overloaded. This function is not called ifprivate$.train_task()andprivate$.predict_task()are overloaded. In contrast to theaffect_columnsparameter.private$.select_cols()is for the inheriting class to determine which columns the operator should function on, e.g. based on feature type, whileaffect_columnsis a way for the user to limit the columns that aPipeOpTaskPreprocshould operate on.
This method can optionally be overloaded when inheritingPipeOpTaskPreproc, together withprivate$.train_dt()andprivate$.predict_dt(); alternatively,private$.train_task()andprivate$.predict_task()can be overloaded.
If this method is not overloaded, it defaults to selecting of type indicated by thefeature_typesconstruction argument.
See Also
https://mlr-org.com/pipeops.html
Other mlr3pipelines backend related:
Graph,
PipeOp,
PipeOpTargetTrafo,
PipeOpTaskPreprocSimple,
mlr_graphs,
mlr_pipeops,
mlr_pipeops_updatetarget
Other PipeOps:
PipeOp,
PipeOpEnsemble,
PipeOpImpute,
PipeOpTargetTrafo,
PipeOpTaskPreprocSimple,
mlr_pipeops,
mlr_pipeops_boxcox,
mlr_pipeops_branch,
mlr_pipeops_chunk,
mlr_pipeops_classbalancing,
mlr_pipeops_classifavg,
mlr_pipeops_classweights,
mlr_pipeops_colapply,
mlr_pipeops_collapsefactors,
mlr_pipeops_colroles,
mlr_pipeops_copy,
mlr_pipeops_datefeatures,
mlr_pipeops_encode,
mlr_pipeops_encodeimpact,
mlr_pipeops_encodelmer,
mlr_pipeops_featureunion,
mlr_pipeops_filter,
mlr_pipeops_fixfactors,
mlr_pipeops_histbin,
mlr_pipeops_ica,
mlr_pipeops_imputeconstant,
mlr_pipeops_imputehist,
mlr_pipeops_imputelearner,
mlr_pipeops_imputemean,
mlr_pipeops_imputemedian,
mlr_pipeops_imputemode,
mlr_pipeops_imputeoor,
mlr_pipeops_imputesample,
mlr_pipeops_kernelpca,
mlr_pipeops_learner,
mlr_pipeops_missind,
mlr_pipeops_modelmatrix,
mlr_pipeops_multiplicityexply,
mlr_pipeops_multiplicityimply,
mlr_pipeops_mutate,
mlr_pipeops_nmf,
mlr_pipeops_nop,
mlr_pipeops_ovrsplit,
mlr_pipeops_ovrunite,
mlr_pipeops_pca,
mlr_pipeops_proxy,
mlr_pipeops_quantilebin,
mlr_pipeops_randomprojection,
mlr_pipeops_randomresponse,
mlr_pipeops_regravg,
mlr_pipeops_removeconstants,
mlr_pipeops_renamecolumns,
mlr_pipeops_replicate,
mlr_pipeops_scale,
mlr_pipeops_scalemaxabs,
mlr_pipeops_scalerange,
mlr_pipeops_select,
mlr_pipeops_smote,
mlr_pipeops_spatialsign,
mlr_pipeops_subsample,
mlr_pipeops_targetinvert,
mlr_pipeops_targetmutate,
mlr_pipeops_targettrafoscalerange,
mlr_pipeops_textvectorizer,
mlr_pipeops_threshold,
mlr_pipeops_tunethreshold,
mlr_pipeops_unbranch,
mlr_pipeops_updatetarget,
mlr_pipeops_vtreat,
mlr_pipeops_yeojohnson