| boxB {univOutl} | R Documentation |
BoxPlot based outlier detection
Description
Identifies univariate outliers by using methods based on BoxPlots
Usage
boxB(x, k=1.5, method='asymmetric', weights=NULL, id=NULL,
exclude=NA, logt=FALSE)
Arguments
x |
Numeric vector that will be searched for outliers. |
k |
Nonnegative constant that determines the extension of the 'whiskers'. Commonly used values are 1.5 (default), 2, or 3.
Note that when |
method |
Character, identifies the method to be used: |
weights |
Optional numeric vector with units' weights associated to the observations in |
id |
Optional vector with identifiers of units in |
exclude |
Values of |
logt |
Logical, if |
Details
When method="resistant" the outlying observations are those outside the interval:
[Q_1 - k \times IQR;\quad Q_3 + k \times IQR]
where Q_1 and Q_3 are respectively the 1st and the 3rd quartile of x, while IQR=(Q_3 - Q_1) is the Inter-Quartile Range. The value k=1.5 (said 'inner fences') is commonly used when drawing a boxplot. Values k=2 and k=3 provide middle and outer fences, respectively.
When method="asymmetric" the outlying observations are those outside the interval:
[Q_1 - 2k \times (Q_2-Q_1);\quad Q_3 + 2k \times (Q_3-Q_2)]
being Q_2 the median; such a modification allows to account for slight skewness of the distribution.
Finally, when method="adjbox" the outlying observations are identified using the method proposed by Hubert and Vandervieren (2008) and based on the Medcouple measure of skewness; in practice the bounds are:
[Q_1-1.5 \times e^{aM} \times IQR;\quad Q_3+1.5 \times e^{bM}\times IQR ]
Where M is the medcouple; when M > 0 (positive skewness) then a = -4 and b = 3; on the contrary a = -3 and b = 4 for negative skewness (M < 0). This adjustment of the boxplot, according to Hubert and Vandervieren (2008), works with moderate skewness (-0.6 \leq M \leq 0.6). The bounds of the adjusted boxplot are derived by applying the function adjboxStats in the package robustbase.
When weights are available (passed via the argument weights) then they are used in the computation of the quartiles. In particular, the quartiles are derived using the function wtd.quantile in the package Hmisc.
Remember that when asking a log transformation (argument logt=TRUE) all the estimates (quartiles, etc.) will refer to log(x+1).
Value
The output is a list containing the following components:
quartiles |
The quartiles of |
fences |
The bounds of the interval, values outside the interval are detected as outliers. |
excluded |
The identifiers or positions (when |
outliers |
The identifiers or positions (when |
lowOutl |
The identifiers or positions (when |
upOutl |
The identifiers or positions (when |
Author(s)
Marcello D'Orazio mdo.statmatch@gmail.com
References
McGill, R., Tukey, J. W. and Larsen, W. A. (1978) ‘Variations of box plots’. The American Statistician, 32, pp. 12-16.
Hubert, M., and Vandervieren, E. (2008) ‘An Adjusted Boxplot for Skewed Distributions’, Computational Statistics and Data Analysis, 52, pp. 5186-5201.
See Also
Examples
set.seed(321)
x <- rnorm(30, 50, 10)
x[10] <- 1
x[20] <- 100
out <- boxB(x = x, k = 1.5, method = 'asymmetric')
out$fences
out$outliers
x[out$outliers]
out <- boxB(x = x, k = 1.5, method = 'adjbox')
out$fences
out$outliers
x[out$outliers]
x[24] <- NA
x.ids <- paste0('obs',1:30)
out <- boxB(x = x, k = 1.5, method = 'adjbox', id = x.ids)
out$excluded
out$fences
out$outliers
set.seed(111)
w <- round(runif(n = 30, min=1, max=10))
out <- boxB(x = x, k = 1.5, method = 'adjbox', id = x.ids, weights = w)
out$excluded
out$fences
out$outliers