huberize {robustbase} | R Documentation |
Huberization – Bringing Outliers In
Description
Huberization (named after Peter Huber's M-estimation algorithm for
location originally) replaces outlying values in a sample x
by
their respective boundary: when x_j < c_1
it is replaced by c_1
and when x_j > c_2
it is replaced by c_2
. Consequently,
values inside the interval [c_1, c_2]
remain unchanged.
Here, c_j = M \pm c\cdot s
where s := s(x)
is
the robust scale estimate Qn(x)
if that is positive,
and by default, M
is the robust huber estimate of location
\mu
(with tuning constant k
).
In the degenerate case where Qn(x) == 0
, trimmed means of
abs(x - M)
are tried as scale estimate s
, with decreasing
trimming proportions specified by the decreasing trim
vector.
Usage
huberize(x, M = huberM(x, k = k)$mu, c = k,
trim = (5:1)/16,
k = 1.5,
warn0 = getOption("verbose"), saveTrim = TRUE)
Arguments
x |
numeric vector which is to be huberized. |
M |
a number; defaulting to |
c |
a positive number, the tuning constant for huberization of the
sample |
trim |
a decreasing vector of trimming proportions in
|
k |
used if |
warn0 |
|
saveTrim |
a |
Details
In regular cases,
s = Qn(x)
is positive and used to huberize values ofx
outside[M - c*s, M + c*s]
.In degenerate cases where
Qn(x) == 0
, we search for ans > 0
by trying the trimmed means := mean(abs(x-M), trim = trim[j])
with less and less trimming (as the trimming proportionstrim[]
must decrease). If even the last,trim[length(trim)]
, leads tos = 0
, a warning is printed whenwarn0
is true.
Value
a numeric vector as x
; in case Qn(x)
was zero and
saveTrim
is true, also containing the (last) trim
proportion used (to compute the scale s
) as attribute "trim"
(see attr()
, attributes
).
Note
For the use in mc()
and similar cases where mainly numerical
stabilization is necessary, a large c = 1e12
will lead to no
huberization, i.e., all y == x
for y <- huberize(x, c)
for typical non-degenerate samples.
Author(s)
Martin Maechler
See Also
huberM
and mc
which is now stabilized by
default via something like huberize(*, c=1e11)
.
Examples
## For non-degenerate data and large c, nothing is huberized,
## as there are *no* really extreme outliers :
set.seed(101)
x <- rnorm(1000)
stopifnot(all.equal(x, huberize(x, c=100)))
## OTOH, the "extremes" are shrunken towards the boundaries for smaller c:
xh <- huberize(x, c = 2)
table(x != xh)
## 45 out of a 1000:
table(xh[x != xh])# 26 on the left boundary -2.098 and 19 on the right = 2.081
## vizualization:
stripchart(x); text(0,1, "x {original}", pos=3); yh <- 0.9
stripchart(xh, at = yh, add=TRUE, col=2)
text(0, yh, "huberize(x, c=2)", col=2, pos=1)
arrows( x[x!=xh], 1,
xh[x!=xh], yh, length=1/8, col=adjustcolor("pink", 1/2))