| makeCPO {mlrCPO} | R Documentation |
Create a Custom CPO Constructor
Description
makeCPO creates a Feature Operation CPOConstructor, i.e. a constructor for a CPO that will
operate on feature columns. makeCPOTargetOp creates a Target Operation CPOConstructor, which
creates CPOs that operate on the target column. makeCPORetrafoless creates a Retrafoless CPOConstructor,
which creates CPOs that may operate on both feature and target columns, but have no retrafo operation. See OperatingType for further
details on the distinction of these. makeCPOExtendedTrafo creates a Feature Operation CPOConstructor that
has slightly more flexibility in its data transformation behaviour than makeCPO (but is otherwise identical).
makeCPOExtendedTargetOp creates a Target Operation CPOConstructor that has slightly more flexibility in its
data transformation behaviour than makeCPOTargetOp but is otherwise identical.
See example section for some simple custom CPO.
Usage
makeCPO(
cpo.name,
par.set = makeParamSet(),
par.vals = NULL,
dataformat = c("df.features", "split", "df.all", "task", "factor", "ordered",
"numeric"),
dataformat.factor.with.ordered = TRUE,
export.params = TRUE,
fix.factors = FALSE,
properties.data = c("numerics", "factors", "ordered", "missings"),
properties.adding = character(0),
properties.needed = character(0),
properties.target = c("cluster", "classif", "multilabel", "regr", "surv", "oneclass",
"twoclass", "multiclass"),
packages = character(0),
cpo.train,
cpo.retrafo
)
makeCPOExtendedTrafo(
cpo.name,
par.set = makeParamSet(),
par.vals = NULL,
dataformat = c("df.features", "split", "df.all", "task", "factor", "ordered",
"numeric"),
dataformat.factor.with.ordered = TRUE,
export.params = TRUE,
fix.factors = FALSE,
properties.data = c("numerics", "factors", "ordered", "missings"),
properties.adding = character(0),
properties.needed = character(0),
properties.target = c("cluster", "classif", "multilabel", "regr", "surv", "oneclass",
"twoclass", "multiclass"),
packages = character(0),
cpo.trafo,
cpo.retrafo
)
makeCPORetrafoless(
cpo.name,
par.set = makeParamSet(),
par.vals = NULL,
dataformat = c("df.all", "task"),
dataformat.factor.with.ordered = TRUE,
export.params = TRUE,
fix.factors = FALSE,
properties.data = c("numerics", "factors", "ordered", "missings"),
properties.adding = character(0),
properties.needed = character(0),
properties.target = c("cluster", "classif", "multilabel", "regr", "surv", "oneclass",
"twoclass", "multiclass"),
packages = character(0),
cpo.trafo
)
makeCPOTargetOp(
cpo.name,
par.set = makeParamSet(),
par.vals = NULL,
dataformat = c("df.features", "split", "df.all", "task", "factor", "ordered",
"numeric"),
dataformat.factor.with.ordered = TRUE,
export.params = TRUE,
fix.factors = FALSE,
properties.data = c("numerics", "factors", "ordered", "missings"),
properties.adding = character(0),
properties.needed = character(0),
properties.target = "cluster",
task.type.out = NULL,
predict.type.map = c(response = "response"),
packages = character(0),
constant.invert = FALSE,
cpo.train,
cpo.retrafo,
cpo.train.invert,
cpo.invert
)
makeCPOExtendedTargetOp(
cpo.name,
par.set = makeParamSet(),
par.vals = NULL,
dataformat = c("df.features", "split", "df.all", "task", "factor", "ordered",
"numeric"),
dataformat.factor.with.ordered = TRUE,
export.params = TRUE,
fix.factors = FALSE,
properties.data = c("numerics", "factors", "ordered", "missings"),
properties.adding = character(0),
properties.needed = character(0),
properties.target = "cluster",
task.type.out = NULL,
predict.type.map = c(response = "response"),
packages = character(0),
constant.invert = FALSE,
cpo.trafo,
cpo.retrafo,
cpo.invert
)
Arguments
cpo.name |
[ | ||||||||||||||||||
par.set |
[ | ||||||||||||||||||
par.vals |
[ | ||||||||||||||||||
dataformat |
[
[type] can be any one of “factor”, “numeric”, “ordered”; if these are given, only a subset of the total
data present is seen by the Note that For If the CPO is a Feature Operation CPO, then the return value of the For Feature Operating CPOs, if If Default is “df.features” for all functions except | ||||||||||||||||||
dataformat.factor.with.ordered |
[ | ||||||||||||||||||
export.params |
[ | ||||||||||||||||||
fix.factors |
[ | ||||||||||||||||||
properties.data |
[ | ||||||||||||||||||
properties.adding |
[ Note that this may not contain a Property names may be postfixed with “.sometimes”, to indicate that adherence should not be checked internally. This distinction is made by
not putting them in the Default is | ||||||||||||||||||
properties.needed |
[ Note that this may not contain a Property names may be postfixed with “.sometimes”, to indicate that adherence should not be checked internally. This distinction is made by
not putting them in the Default is | ||||||||||||||||||
properties.target |
[ For Target Operation CPOs, this must contain exactly one of “cluster”, “classif”, “multilabel”, “regr”, “surv”.
This indicates the type of | ||||||||||||||||||
packages |
[ | ||||||||||||||||||
cpo.train |
[ The behaviour of this function differs slightly in Feature Operation and Target Operation CPOs. For Feature Operation CPOs, if If For Target Operation CPOs, if If This parameter may be | ||||||||||||||||||
cpo.retrafo |
[ This function gets called during the “retransformation” step where prediction data is given to the In Feature Operation CPOs, this function receives the data to be
transformed and must return the transformed data in the same format as it received them.
The format of In Target Operation CPOs created with In Target Operation CPOs created with If | ||||||||||||||||||
cpo.trafo |
[ This functions primary task is to transform the given data when the For CPOs that are not Retrafoless, a unit of information to be carried over to the retrafo step needs to be created inside the If For Target Operation CPOs created with | ||||||||||||||||||
task.type.out |
[ If this is | ||||||||||||||||||
predict.type.map |
[ In short, the
| ||||||||||||||||||
constant.invert |
[ For For Default is | ||||||||||||||||||
cpo.train.invert |
This is a function which must have the parameters This function receives the feature columns given for prediction, and must return a
control object that will be passed on to the If | ||||||||||||||||||
cpo.invert |
[ The This function performs the inversion for a Target Operation CPO. It takes a control object, which summarizes information from the training and
retrafo step, and the prediction as returned by a machine learning model, and undoes the operation done to the target column in the For example, if the trafo step consisted of taking the logarithm of a regression target, the As a more elaborate example, a CPO could train a model on the training data and set the target values to the residues of that trained model.
The |
Value
[CPOConstructor]. A Constructor for CPOs.
CPO Internals
The mlrCPO package offers a powerful framework for handling the tasks necessary for preprocessing, so that the user, when creating custom CPOs, can focus on the actual data transformations to perform. It is, however, useful to understand what it is that the framework does, and how the process can be influenced by the user during CPO definition or application. Aspects of preprocessing that the user needs to influence are:
- Operating Type
-
The core of preprocessing is the actual transformation being performed. In the most general sense, there are three points in a machine learning pipeline that preprocessing can influence.
Transformation of training data before model fitting, done in mlr using
train. In the CPO framework (when not using aCPOLearnerwhich makes all of these steps transparent to the user), this is done by aCPO.transformation of new validation or prediction data that is given to the fitted model for prediction, done using
predict. This is done by aCPORetraforetrieved usingretrafofrom the result of step 1.transformation of the predictions made to invert the transformation of the target values done in step 1, which is done using the
CPOInverterretrieved usinginverterfrom the result of step 2.
The framework poses restrictions on primitive (i.e. not compound using
composeCPO)CPOs to simplify internal operation: ACPOmay be one of three OperatingTypes (see there). The Feature OperationCPOdoes not transform target columns and hence only needs to be involved in steps 1 and 2. The Target OperationCPOonly transforms target columns, and therefore mostly concerns itself with steps 1 and 3. A RetrafolessCPOmay change both feature and target columns, but may not perform a retrafo or inverter operation (and is therefore only concerned with step 1). Note that this is effectively a restriction on what kind of transformation a Retrafoless CPO may perform: it must not be a transformation of the data or target space, it may only act or subtract points within this space.The Operating Type of a
CPOis ultimately dependent on the function that was used to create theCPOConstructor:makeCPO/makeCPOExtendedTrafo,makeCPOTargetOp/makeCPOExtendedTargetOp, ormakeCPORetrafoless. - Data Transformation
-
At the core of a CPO is the modification of data it performs. For Feature Operation CPOs, the transformation of each row, during training and prediction, should happen in the same way, and it may only depend on the entirety of the training data–i.e. the value of a data row in a prediction data set may not influence the transformation of a different prediction data row. Furthermore, if a data row occurs in both training and prediction data, its transformation result should ideally be the same.
This property is ensured by
makeCPOby splitting the transformation into two functions: One function that collects all relevant information from the training data (calledcpo.train), and one that transforms given data, using this collected information and (potentially new, unseen) data to be transformed (calledcpo.retrafo). Thecpo.retrafofunction should handle all data as if it were prediction data and unrelated to the data given tocpo.train.Internally, when a
CPOgets applied to a data set usingapplyCPO, thecpo.trainfunction is called, and the resulting control object is used for a subsequentcpo.retrafocall which transforms the data. Before the result is given back from theapplyCPOcall, the control object is used to create aCPORetrafoobject, which is attached to the result as attribute. Target Operating CPOs additionally create and add aCPOInverterobject.When a
CPORetrafois then applied to new prediction data, the control object previously returned bycpo.trainis given, combined with this new data, to anothercpo.retrafocall that performs the new transformation.makeCPOExtendedTrafogives more flexibility by having calling only thecpo.trafoin the training step, which both creates a control object and modifies the data. This can increase performance if the underlying operation creates a control object and the transformed data in one step, as for example PCA does. Note that the requirement that the same row in training and prediction data should result in the same transformation result still stands. Thecpo.trafofunction returns the transformed data and creates a local variable with the control information, which the CPO framework will access. - Inversion
-
If a
CPOperforms transformations of the target column, the predictions made by a following machine learning process should ideally have this transformation undone, so that if the process makes a prediction that coincides with a target value after the transformation, the whole pipeline should return a prediction that equals to the target value before this transformation.This is done by the
cpo.invertfunction given tomakeCPOTargetOp. It has access to information from both the preceding training and prediction steps. During the training step,cpo.traincreateas acontrolobject that is not only given tocpo.retrafo, but also tocpo.train.invert. This latter function is called before the prediction step, whenever new data is fed to the machine learning process. It takes the new data and the oldcontrolobject and transforms it to a newcontrol.invertobject to include information about the prediction data. This object is then given tocpo.invert.It is possible to have Target Operation CPOs that do not require information from the retrafo step. This is specified by setting
constant.inverttoTRUE. It has the advantage that the sameCPOInvertercan be used for inversion of predictions made with any new data. Otherwise, a newCPOInverterobject must be obtained for each new data set after the retrafo step (using theinverterfunction on the retrafo result). Havingconstant.invertset toTRUEresults in hybrid retrafo / inverter objects: TheCPORetrafoobject can then also be used forinversions. When defining aconstant.invertTarget Operating CPO, nocpo.train.invertfunction is given, and the samecontrolobject is given to bothcpo.retrafoandcpo.invert.makeCPOExtendedTargetOpgives more flexibility and allows more efficient implementation of Target Operating CPOs at cost of more complexity. With this method, acpo.trafofunction is given that is executed during the first training step; It must return the transformed target column, as well as acontrolandcontrol.invertobject. Thecpo.retrafofunction not only transforms the target, but must also create a newcontrol.invertobject (unlessconstant.invertisTRUE). The semantics ofcpo.invertis identical with the basicmakeCPOTargetOp. cpo.train-cpo.retrafoinformation transfer-
One possibility to transfer information from
cpo.traintocpo.retrafois to havecpo.trainreturn a control object (alist) that is then given tocpo.retrafo. The CPO is then called an object based CPO.Another possibility is to not give the
cpo.retrafoargument (set it toNULLin themakeCPOcall) and havecpo.traininstead return a function instead. This function is then used as thecpo.retrafofunction, and should have access to all relevant information about the training data as a closure. This is called functional CPO. To save memory, the actual data (including target) given tocpo.trainis removed from the environment of its return value in this case (i.e. the environment of thecpo.retrafofunction). This means thecpo.retrafofunction may not reference a “data” variable.There are similar possibilities of functional information transfer for other types of CPOs:
cpo.trafoinmakeCPOExtendedTargetOpmay create acpo.retrafofunction instead of acontrolobject.cpo.traininmakeCPOTargetOphas the option of creating acpo.retrafoandcpo.train.invert(cpo.invertifconstant.invertisTRUE) function (and returningNULL) instead of returning acontrolobject. Similarly,cpo.train.invertmay return acpo.invertfunction instead of acontrol.invertobject. InmakeCPOExtendedTargetOp,cpo.trafomay create acpo.retrafoor acpo.invertfunction, each optionally instead of acontrolorcontrol.invertobject (one or both may be functional).cpo.retrafosimilarly may create acpo.invertfunction instead of giving acontrol.invertobject. Functional information transfer may be more parsimonious and elegant than control object information transfer. - Hyperparameters
-
The action performed by a CPO may be influenced using hyperparameters, during its construction as well as afterwards (then using
setHyperPars). Hyperparameters must be specified as aParamSetand given as argumentpar.set. Default values for each parameter may be specified in thisParamSetor optionally as another argumentpar.vals.Hyperparameters given are made part of the
CPOConstructorfunction and can thus be given during construction. Parameter default values function as the default values for theCPOConstructorfunction parameters (which are thus made optional function parameters of theCPOConstructorfunction). The CPO framework handles storage and changing of hyperparameter values. When thecpo.trainandcpo.retrafofunctions are called to transform data, the hyperparameter values are given to them as arguments, socpo.trainandcpo.retrafofunctions must be able to accept these parameters, either directly, or with a...argument.Note that with functional
CPOs, thecpo.retrafofunction does not take hyperparameter arguments (and instead can usually refer to them by its environment).Hyperparameters may be exported (or not), thus making them available for
setHyperPars. Not exporting a parameter has advantage that it does not clutter theParamSetof a bigCPOorCPOLearnerpipeline with many hyperparameters. Which hyperparameters are exported is chosen during the constructing call of aCPOConstructor, but the default exported hyperparameters can be chosen with theexport.paramsparameter. - Properties
-
Similarly to
Learners,CPOs may specify what kind of data they are and are not able to handle. This is done by specifying.properties.*arguments. The names of possible properties are the same as possibleLearnerProperties, but sinceCPOs mostly concern themselves with data, only the properties indicating column and task types are relevant.For each
CPOone must specifywhich kind of data does the
CPOhandle,which kind of data must the
CPOorLearnerbe able to handle that comes after the givenCPO, andwhich kind of data handling capability does the given
CPOadd to a followingCPOorLearnerif coming before it in a pipeline.
The specification of (1) is done with
properties.dataandproperties.target, (2) is specified usingproperties.needed, and (3) is specified usingproperties.adding. Internally,properties.dataandproperties.targetare concatenated and treated as one vector, they are specified separately inmakeCPOetc. for convenience reasons. SeeCPOPropertiesfor details.The CPO framework checks the
cpo.retrafoetc. functions for adherence to these properties, so it e.g. throws an error if acpo.retrafofunction adds missing values to some data but didn't declare “missings” inproperties.needed. It may be desirable to have this internal checking happen to a laxer standard than the property checking when composing CPOs (e.g. when a CPO adds missings only with certain hyperparameters, one may still want to compose this CPO to another one that can't handle missings). Therefore it is possible to postfix listed properties with “.sometimes”. The internal CPO checking will ignore these when listed inproperties.adding(it uses the ‘minimal’ set of adding properties,adding.min), and it will not declare them externally when listed inproperties.needed(but keeps them internally in the ‘maximal’ set of needed properties,needed.max). Theadding.minandneeded.maxcan be retrieved usinggetCPOPropertieswithget.internal = TRUE. - Data Format
-
Different CPOs may want to change different aspects of the data, e.g. they may only care about numeric columns, they may or may not care about the target column values, sometimes they might need the actual task used as input. The CPO framework offers to present the data in a specified formats to the
cpo.train,cpo.retrafoand other functions, to reduce the need for boilerplate data subsetting on the user's part. The format is requested using thedataformatanddataformat.factor.with.orderedparameter. Acpo.retrafofunction is expected to return data in the same format as it requested, so if it requested aTask, it must return one, while if it only requested the featuredata.frame, adata.framemust be returned. - Task Conversion
-
Target Operation CPOs can be used for conversion between
Tasks. For this, thetype.outvalue must be given. Task conversion works with all values ofdataformatand is handled by the CPO framework. Thecpo.trafofunction must take care to return the target data in a proper format (see above). Note that for conversion, not only does theTasktype need to be changed duringcpo.trafo, but also the prediction format (see above) needs to change. - Fix Factors
-
Some preprocessing for factorial columns needs the factor levels to be the same during training and prediction. This is usually not guarranteed by mlr, so the framework offers to do this if the
fix.factorsflag is set. - ID
-
To prevent parameter name clashes when
CPOs are concatenated, the parameters are prefixed with theCPOs id. The ID can be set duringCPOconstruction, but will default to theCPOs name if not given. The name is set using thecpo.nameparameter. - Packages
-
Whenever a
CPOneeds certain packages to be installed to work, it can specify these in thepackagesparameter. The framework will check for the availability of the packages and throw an error if not found during construction. This means that loading aCPOfrom a savefile will omit this check, but in most cases it is a sufficient measure to make the user aware of missing packages in time. - Target Column Format
-
Different
Tasktypes have the target in a different formats. They are listed here for reference. Target data is in this format when given to thetargetargument of some functions, and must be returned in this format bycpo.trafoin Target Operation CPOs. Target values are always in the format of adata.frame, even when only one column.Task type target format “classif” one column of factor“cluster” data.framewith zero columns.“multilabel” several columns of logical“regr” one column of numeric“surv” two columns of numericWhen inverting, the format of the
targetargument, as well as the return value of, thecpo.invertfunction depends on theTasktype as well as thepredict.type. The requested return valuepredict.typeis given to thecpo.invertfunction as a parameter, thepredict.typeof thetargetparameter depends on this and thepredict.type.map(see PredictType). The format of the prediction, depending on the task type andpredict.type, is:Task type predict.typetarget format “classif” “response” factor“classif” “prob” matrixwith nclass cols“cluster” “response” integercluster index“cluster” “prob” matrixwith nclustr cols“multilabel” “response” logicalmatrix“multilabel” “prob” matrixwith nclass cols“regr” “response” numeric“regr” “se” 2-col matrix“surv” “response” numeric“surv” “prob” [NOT YET SUPPORTED]
Headless function definitions
In the place of all cpo.* arguments, it is possible to make a headless function definition, consisting only of the function body.
This function body must always begin with a ‘{’. For example, instead of
cpo.retrafo = function(data, control) data[-1], it is possible to use
cpo.retrafo = function(data, control) { data[-1] }. The necessary function head is then added automatically by the CPO framework.
This will always contain the necessary parameters (e.g. “data”, “target”, hyperparameters as defined in par.set)
in the names as required. This can declutter the definition of a CPOConstructor and is recommended if the CPO consists of
few lines.
Note that if this is used when writing an R package, inside a function, this may lead to the automatic R correctness checker to print warnings.
See Also
Other CPOConstructor related:
CPOConstructor,
getCPOClass(),
getCPOConstructor(),
getCPOName(),
identicalCPO(),
print.CPOConstructor()
Other CPO lifecycle related:
CPOConstructor,
CPOLearner,
CPOTrained,
CPO,
NULLCPO,
%>>%(),
attachCPO(),
composeCPO(),
getCPOClass(),
getCPOConstructor(),
getCPOTrainedCPO(),
identicalCPO()
Examples
# an example constant feature remover CPO
constFeatRem = makeCPO("constFeatRem",
dataformat = "df.features",
cpo.train = function(data, target) {
names(Filter(function(x) { # names of columns to keep
length(unique(x)) > 1
}, data))
}, cpo.retrafo = function(data, control) {
data[control]
})
# alternatively:
constFeatRem = makeCPO("constFeatRem",
dataformat = "df.features",
cpo.train = function(data, target) {
cols.keep = names(Filter(function(x) {
length(unique(x)) > 1
}, data))
# the following function will do both the trafo and retrafo
result = function(data) {
data[cols.keep]
}
result
}, cpo.retrafo = NULL)