adjOutlyingness {robustbase} | R Documentation |
Compute (Skewness-adjusted) Multivariate Outlyingness
Description
For an n \times p
data matrix (or data frame) x
,
compute the “outlyingness” of all n
observations.
Outlyingness here is a generalization of the Donoho-Stahel
outlyingness measure, where skewness is taken into account via the
medcouple, mc()
.
Usage
adjOutlyingness(x, ndir = 250, p.samp = p, clower = 4, cupper = 3,
IQRtype = 7,
alpha.cutoff = 0.75, coef = 1.5,
qr.tol = 1e-12, keep.tol = 1e-12,
only.outlyingness = FALSE, maxit.mult = max(100, p),
trace.lev = 0,
mcReflect = n <= 100, mcScale = TRUE, mcMaxit = 2*maxit.mult,
mcEps1 = 1e-12, mcEps2 = 1e-15,
mcTrace = max(0, trace.lev-1))
Arguments
x |
a numeric |
ndir |
positive integer specifying the number of directions that should be searched. |
p.samp |
the sample size to use for finding good random
directions, must be at least |
clower , cupper |
the constant to be used for the lower and upper
tails, in order to transform the data towards symmetry. You can set
|
IQRtype |
a number from |
alpha.cutoff |
number in (0,1) specifying the quantiles
|
coef |
positive number specifying the factor with which the
interquartile range ( |
qr.tol |
positive tolerance to be used for |
keep.tol |
positive tolerance to determine which of the sample
direction should be kept, namely only those for which
|
only.outlyingness |
logical indicating if the final outlier determination should be skipped. In that case, a vector is returned, see ‘Value:’ below. |
maxit.mult |
integer factor; |
trace.lev |
an integer, if positive allows to monitor the direction search. |
mcReflect |
passed as |
mcScale |
passed as |
mcMaxit |
passed as |
mcEps1 |
passed as |
mcEps2 |
passed as |
mcTrace |
passed as |
Details
FIXME: Details in the comment of the Matlab code; also in the reference(s).
The method as described can be useful as preprocessing in FASTICA (http://research.ics.aalto.fi/ica/fastica/ see also the R package fastICA.
Value
If only.outlyingness
is true, a vector adjout
,
otherwise, as by default, a list with components
adjout |
numeric of |
cutoff |
cutoff for “outlier” with respect to the adjusted
outlyingnesses, and depending on |
nonOut |
logical of |
Note
If there are too many degrees of freedom for the projections, i.e., when
n \le 4p
, the current definition of adjusted outlyingness
is ill-posed, as one of the projections may lead to a denominator
(quartile difference) of zero, and hence formally an adjusted
outlyingness of infinity.
The current implementation avoids Inf
results, but will return
seemingly random adjout
values of around 10^{14} -- 10^{15}
which may
be completely misleading, see, e.g., the longley
data example.
The result is random as it depends on the sample of
ndir
directions chosen; specifically, to get sub samples the algorithm uses
sample.int(n, p.samp)
which from R version 3.6.0 depends on
RNGkind(*, sample.kind)
. Exact reproducibility of results
from R versions 3.5.3 and earlier, requires setting
RNGversion("3.5.0")
.
In any case, do use set.seed()
yourself
for reproducibility!
Till Aug/Oct. 2014, the default values for clower
and cupper
were
accidentally reversed, and the signs inside exp(.)
where swapped
in the (now corrected) two expressions
tup <- Q3 + coef * IQR * exp(.... + clower * tmc * (tmc < 0)) tlo <- Q1 - coef * IQR * exp(.... - cupper * tmc * (tmc < 0))
already in the code from Antwerpen (‘mcrsoft/adjoutlingness.R’), contrary to the published reference.
Further, the original algorithm had not been scale-equivariant in the direction construction, which has been amended in 2014-10 as well.
The results, including diagnosed outliers, therefore have changed, typically slightly, since robustbase version 0.92-0.
Author(s)
Guy Brys; help page and improvements by Martin Maechler
References
Brys, G., Hubert, M., and Rousseeuw, P.J. (2005) A Robustification of Independent Component Analysis; Journal of Chemometrics, 19, 1–12.
Hubert, M., Van der Veeken, S. (2008) Outlier detection for skewed data; Journal of Chemometrics 22, 235–246; doi:10.1002/cem.1123.
For the up-to-date reference, please consult https://wis.kuleuven.be/statdatascience/robust
See Also
the adjusted boxplot, adjbox
and the medcouple,
mc
.
Examples
## An Example with bad condition number and "border case" outliers
dim(longley) # 16 x 7 // set seed, as result is random :
set.seed(31)
ao1 <- adjOutlyingness(longley, mcScale=FALSE)
## which are outlying ?
which(!ao1$nonOut) ## for this seed, two: "1956", "1957"; (often: none)
## For seeds 1:100, we observe (Linux 64b)
if(FALSE) {
adjO <- sapply(1:100, function(iSeed) {
set.seed(iSeed); adjOutlyingness(longley)$nonOut })
table(nrow(longley) - colSums(adjO))
}
## #{outl.}: 0 1 2 3
## #{cases}: 74 17 6 3
## An Example with outliers :
dim(hbk)
set.seed(1)
ao.hbk <- adjOutlyingness(hbk)
str(ao.hbk)
hist(ao.hbk $adjout)## really two groups
table(ao.hbk$nonOut)## 14 outliers, 61 non-outliers:
## outliers are :
which(! ao.hbk$nonOut) # 1 .. 14 --- but not for all random seeds!
## here, they are(*) the same as found by (much faster) MCD:
## *) only "almost", since the 2023-05 change to covMcd()
cc <- covMcd(hbk)
table(cc = cc$mcd.wt, ao = ao.hbk$nonOut)# one differ..:
stopifnot(sum(cc$mcd.wt != ao.hbk$nonOut) <= 1)
## This is revealing: About 1--2 cases, where outliers are *not* == 1:14
## (2023: ~ 1/8 [sec] per call)
if(interactive()) {
for(i in 1:30) {
print(system.time(ao.hbk <- adjOutlyingness(hbk)))
if(!identical(iout <- which(!ao.hbk$nonOut), 1:14)) {
cat("Outliers:\n"); print(iout)
}
}
}
## "Central" outlyingness: *not* calling mc() anymore, since 2014-12-11:
trace(mc)
out <- capture.output(
oo <- adjOutlyingness(hbk, clower=0, cupper=0)
)
untrace(mc)
stopifnot(length(out) == 0)
## A rank-deficient case
T <- tcrossprod(data.matrix(toxicity))
try(adjOutlyingness(T, maxit. = 20, trace.lev = 2)) # fails and recommends:
T. <- fullRank(T)
aT <- adjOutlyingness(T.)
plot(sort(aT$adjout, decreasing=TRUE), log="y")
plot(T.[,9:10], col = (1:2)[1 + (aT$adjout > 10000)])
## .. (not conclusive; directions are random, more 'ndir' makes a difference!)