transfo {cellWise}R Documentation

Robustly fit the Box-Cox or Yeo-Johnson transformation

Description

This function uses reweighted maximum likelihood to robustly fit the Box-Cox or Yeo-Johnson transformation to each variable in a dataset. Note that this function first calls checkDataSet to ensure that the variables to be transformed are not too discrete.

Usage

transfo(X, type = "YJ", robust = TRUE,
        standardize = TRUE,
        quant = 0.99, nbsteps = 2, checkPars = list())

Arguments

X

A data matrix of dimensions n x d. Its columns are the variables to be transformed.

type

The type of transformation to be fit. Should be one of:

  • "BC": Box-Cox power transformation. Only works for strictly positive variables. If this type is given but a variable is not strictly positive, the function stops with a message about that variable.

  • "YJ" Yeo-Johnson power transformation. The data may have positive as well as negative values.

  • "bestObj" for strictly positive variables both BC and YJ are run, and the solution with lowest objective is kept. On the other variables YJ is run.

robust

if TRUE the Reweighted Maximum Likelihood method is used, which first computes a robust initial estimate of the transformation parameter lambda. If FALSE the classical ML method is used.

standardize

whether to standardize the variables before and after the power transformation. See Details below.

quant

quantile for determining the weights in the reweighting step (ignored when robust=FALSE).

nbsteps

number of reweighting steps (ignored when robust=FALSE).

checkPars

Optional list of parameters used in the call to checkDataSet. The options are:

  • coreOnly
    If TRUE, skip the execution of checkDataset. Defaults to FALSE

  • numDiscrete
    A column that takes on numDiscrete or fewer values will be considered discrete and not retained in the cleaned data. Defaults to 5.

  • precScale
    Only consider columns whose scale is larger than precScale. Here scale is measured by the median absolute deviation. Defaults to 1e-12.

  • silent
    Whether or not the function progress messages should be printed. Defaults to FALSE.

Details

In case standardize = TRUE, the variables is standardized before and after transformation. For BC the variable is divided by its median before transformation. For YJ and robust = TRUE this subtracts its median and divides by its mad (median absolute deviation) before transformation. For YJ and robust = FALSE this subtracts the mean and divides by the standard deviation before transformation. For the standardization after the transformation, the classical mean and standard deviation are used in case robust = FALSE. If robust = TRUE, the mean and standard deviation are calculated robustly on a subset of inliers.

Value

A list with components:

Author(s)

J. Raymaekers and P.J. Rousseeuw

References

J. Raymaekers and P.J. Rousseeuw (2021). Transforming variables to central normality. Machine Learning. doi:10.1007/s10994-021-05960-5(link to open access pdf)

See Also

transfo_newdata, transfo_transformback

Examples


# find Box-Cox transformation parameter for lognormal data:
set.seed(123)
x <- exp(rnorm(1000))
transfo.out <- transfo(x, type = "BC")
# estimated parameter:
transfo.out$lambdahat
# value of the objective function:
transfo.out$objective
# the transformed variable:
transfo.out$Y
# the type of transformation used:
transfo.out$ttypes
# qqplot of the transformed variable:
qqnorm(transfo.out$Y); abline(0,1)

# For more examples, we refer to the vignette:
## Not run: 
vignette("transfo_examples")

## End(Not run)

[Package cellWise version 2.5.3 Index]