boxB {univOutl} | R Documentation |
BoxPlot based outlier detection
Description
Identifies univariate outliers by using methods based on BoxPlots
Usage
boxB(x, k=1.5, method='asymmetric', weights=NULL, id=NULL,
exclude=NA, logt=FALSE)
Arguments
x |
Numeric vector that will be searched for outliers. |
k |
Nonnegative constant that determines the extension of the 'whiskers'. Commonly used values are 1.5 (default), 2, or 3.
Note that when |
method |
Character, identifies the method to be used: |
weights |
Optional numeric vector with units' weights associated to the observations in |
id |
Optional vector with identifiers of units in |
exclude |
Values of |
logt |
Logical, if |
Details
When method="resistant"
the outlying observations are those outside the interval:
[Q_1 - k \times IQR;\quad Q_3 + k \times IQR]
where Q_1
and Q_3
are respectively the 1st and the 3rd quartile of x
, while IQR=(Q_3 - Q_1)
is the Inter-Quartile Range. The value k=1.5
(said 'inner fences') is commonly used when drawing a boxplot. Values k=2
and k=3
provide middle and outer fences, respectively.
When method="asymmetric"
the outlying observations are those outside the interval:
[Q_1 - 2k \times (Q_2-Q_1);\quad Q_3 + 2k \times (Q_3-Q_2)]
being Q_2
the median; such a modification allows to account for slight skewness of the distribution.
Finally, when method="adjbox"
the outlying observations are identified using the method proposed by Hubert and Vandervieren (2008) and based on the Medcouple measure of skewness; in practice the bounds are:
[Q_1-1.5 \times e^{aM} \times IQR;\quad Q_3+1.5 \times e^{bM}\times IQR ]
Where M is the medcouple; when M > 0
(positive skewness) then a = -4
and b = 3
; on the contrary a = -3
and b = 4
for negative skewness (M < 0
). This adjustment of the boxplot, according to Hubert and Vandervieren (2008), works with moderate skewness (-0.6 \leq M \leq 0.6
). The bounds of the adjusted boxplot are derived by applying the function adjboxStats
in the package robustbase.
When weights are available (passed via the argument weights
) then they are used in the computation of the quartiles. In particular, the quartiles are derived using the function wtd.quantile
in the package Hmisc.
Remember that when asking a log transformation (argument logt=TRUE
) all the estimates (quartiles, etc.) will refer to log(x+1)
.
Value
The output is a list containing the following components:
quartiles |
The quartiles of |
fences |
The bounds of the interval, values outside the interval are detected as outliers. |
excluded |
The identifiers or positions (when |
outliers |
The identifiers or positions (when |
lowOutl |
The identifiers or positions (when |
upOutl |
The identifiers or positions (when |
Author(s)
Marcello D'Orazio mdo.statmatch@gmail.com
References
McGill, R., Tukey, J. W. and Larsen, W. A. (1978) ‘Variations of box plots’. The American Statistician, 32, pp. 12-16.
Hubert, M., and Vandervieren, E. (2008) ‘An Adjusted Boxplot for Skewed Distributions’, Computational Statistics and Data Analysis, 52, pp. 5186-5201.
See Also
Examples
set.seed(321)
x <- rnorm(30, 50, 10)
x[10] <- 1
x[20] <- 100
out <- boxB(x = x, k = 1.5, method = 'asymmetric')
out$fences
out$outliers
x[out$outliers]
out <- boxB(x = x, k = 1.5, method = 'adjbox')
out$fences
out$outliers
x[out$outliers]
x[24] <- NA
x.ids <- paste0('obs',1:30)
out <- boxB(x = x, k = 1.5, method = 'adjbox', id = x.ids)
out$excluded
out$fences
out$outliers
set.seed(111)
w <- round(runif(n = 30, min=1, max=10))
out <- boxB(x = x, k = 1.5, method = 'adjbox', id = x.ids, weights = w)
out$excluded
out$fences
out$outliers