R: Data cleaning by winsorization

winsorize {robustHD}

R Documentation

Data cleaning by winsorization

Description

Clean data by means of winsorization, i.e., by shrinking outlying observations to the border of the main part of the data.

Usage

winsorize(x, ...)

## Default S3 method:
winsorize(
  x,
  standardized = FALSE,
  centerFun = median,
  scaleFun = mad,
  const = 2,
  return = c("data", "weights"),
  ...
)

## S3 method for class 'matrix'
winsorize(
  x,
  standardized = FALSE,
  centerFun = median,
  scaleFun = mad,
  const = 2,
  prob = 0.95,
  tol = .Machine$double.eps^0.5,
  return = c("data", "weights"),
  ...
)

## S3 method for class 'data.frame'
winsorize(x, ...)

Arguments

`x`	a numeric vector, matrix or data frame to be cleaned.
`...`	for the generic function, additional arguments to be passed down to methods. For the `"data.frame"` method, additional arguments to be passed down to the `"matrix"` method. For the other methods, additional arguments to be passed down to `robStandardize`.
`standardized`	a logical indicating whether the data are already robustly standardized.
`centerFun`	a function to compute a robust estimate for the center to be used for robust standardization (defaults to `median`). Ignored if `standardized` is `TRUE`.
`scaleFun`	a function to compute a robust estimate for the scale to be used for robust standardization (defaults to `mad`). Ignored if `standardized` is `TRUE`.
`const`	numeric; tuning constant to be used in univariate winsorization (defaults to 2).
`return`	character string; if `standardized` is `TRUE`, this specifies the type of return value. Possible values are `"data"` for returning the cleaned data, or `"weights"` for returning data cleaning weights.
`prob`	numeric; probability for the quantile of the `\chi^{2}` distribution to be used in multivariate winsorization (defaults to 0.95).
`tol`	a small positive numeric value used to determine singularity issues in the computation of correlation estimates based on bivariate winsorization (see `corHuber`).

Details

The borders of the main part of the data are defined on the scale of the robustly standardized data. In the univariate case, the borders are given by +/-const, thus a symmetric distribution is assumed. In the multivariate case, a normal distribution is assumed and the data are shrunken towards the boundary of a tolerance ellipse with coverage probability prob. The boundary of this ellipse is thereby given by all points that have a squared Mahalanobis distance equal to the quantile of the \chi^{2} distribution given by prob.

Value

If standardize is TRUE and return is "weights", a set of data cleaning weights. Multiplying each observation of the standardized data by the corresponding weight yields the cleaned standardized data.

Otherwise an object of the same type as the original data x containing the cleaned data is returned.

Note

Data cleaning weights are only meaningful for standardized data. In the general case, the data need to be standardized first, then the data cleaning weights can be computed and applied to the standardized data, after which the cleaned standardized data need to be backtransformed to the original scale.

Author(s)

Andreas Alfons, based on code by Jafar A. Khan, Stefan Van Aelst and Ruben H. Zamar

References

Khan, J.A., Van Aelst, S. and Zamar, R.H. (2007) Robust linear model selection based on least angle regression. Journal of the American Statistical Association, 102(480), 1289–1299. doi:10.1198/016214507000000950

Examples

## generate data
set.seed(1234)     # for reproducibility
x <- rnorm(10)     # standard normal
x[1] <- x[1] * 10  # introduce outlier

## winsorize data
x
winsorize(x)

[Package robustHD version 0.8.1 Index]