R: Identification of outliers using control charts

KRDetect.outliers.controlchart {envoutliers}

R Documentation

Identification of outliers using control charts

Description

Identification of outliers in environmental data using two-step method based on kernel smoothing and control charts (Campulova et al., 2017). The outliers are identified as observations corresponding to segments of smoothing residuals exceeding control charts limits.

Usage

KRDetect.outliers.controlchart(x, perform.smoothing = TRUE,
  bandwidth.type = "local", bandwidth.value = NULL, kernel.order = 2,
  method = "range", group.size.x = 3, group.size.R = 3,
  group.size.s = 3, L.x = 3, L.R = 3, L.s = 3)

Arguments

`x`	data values. Supported data types a numeric vector a time series object `ts` a time series object `xts` a time series object `zoo`
`perform.smoothing`	a logical value specifying if data smoothing is performed. If `TRUE` (default), data are smoothed.
`bandwidth.type`	a character string specifying the type of bandwidth. Possible options are `"local"` (default) to use local bandwidth `"global"` to use global bandwidth
`bandwidth.value`	a local bandwidth array (for `bandwidth.type = "local"`) or global bandwidth value (for `bandwidth.type = "global"`) for kernel regression estimation. If `bandwidth.type = "NULL"` (default) a data-adaptive local plug-in (Herrmann, 1997) (for `bandwidth.type = "local"`) or data-adaptive global plug-in (Gasser et al., 1991) (for `bandwidth.type = "global"`) bandwidth is used instead.
`kernel.order`	a nonnegative integer giving the order of the optimal kernel (Gasser et al., 1985) used for smoothing. Possible options are `kernel.order = 2` (default) `kernel.order = 4`
`method`	a character string specifying the preferred estimate of standard deviation parameter. Possible options are `"range"` (default) for estimation based on sample ranges `"sd"` for estimation based on sample standard deviations
`group.size.x`	a positive integer giving the number of observations in individual segments used for computation of x chart control limits. If the data can not be equidistantly divided, the first extra values will be excluded from the analysis. Default is `group.size.x = 3`.
`group.size.R`	a positive integer giving the number of observations in individual segments used for computation of R chart control limits. If the data can not be equidistantly divided, the first extra values will be excluded from the analysis. Default is `group.size.R = 3`.
`group.size.s`	a positive integer giving the number of observations in individual segments used for computation of s chart control limits. If the data can not be equidistantly divided, the first extra values will be excluded from the analysis. Default is `group.size.s = 3`.
`L.x`	a positive numeric value giving parameter `L` specifying the width of x chart control limits. Default is `L.x = 3`.
`L.R`	a positive numeric value giving parameter `L` specifying the width of R chart control limits. Default is `L.R = 3`.
`L.s`	a positive numeric value giving parameter `L` specifying the width of s chart control limits. Default is `L.s = 3`.

Details

This function identifies outliers in environmental data using two-step procedure (Campulova et al., 2017). The procedure consists of kernel smoothing and subsequent identification of observations corresponding to segments of smoothing residuals exceeding control charts limits. This way the method does not identify individual outliers but segments of observations, where the outliers occur. The output of the method are three logical vectors specyfing the outliers identified based on each of the three control charts. Beside that logical vector specyfing the outliers identified based on at least one type of control limits is returned. Crucial for the method is the choice of paramaters L.x, L.R and L.s specifying the width of control limits. Different values of the parameters determine different criteria for outlier detection. For more information see (Campulova et al., 2017).

Value

A "KRDetect" object which contains a list with elements:

`method.type`	a character string giving the type of method used for outlier idetification
`x`	a numeric vector of observations
`index`	a numeric vector of index design points assigned to individual observations
`smoothed`	a numeric vector of estimates of the kernel regression function (smoothed data)
`outlier.x`	a logical vector specyfing the identified outliers based on limits of control chart x, `TRUE` means that corresponding observation from vector `x` is detected as outlier
`outlier.R`	a logical vector specyfing the identified outliers based on limits of control chart R, `TRUE` means that corresponding observation from vector `x` is detected as outlier
`outlier.s`	a logical vector specyfing the identified outliers based on limits of control chart s, `TRUE` means that corresponding observation from vector `x` is detected as outlier
`outlier`	a logical vector specyfing the identified outliers based on at least one type of control limits. `TRUE` means that corresponding observation from vector `x` is detected as outlier
`LCL.x`	a numeric value giving lower control limit of control chart x
`UCL.x`	a numeric value giving upper control limit of control chart x
`LCL.s`	a numeric value giving lower control limit of control chart s
`UCL.s`	a numeric value giving upper control limit of control chart s
`LCL.R`	a numeric value giving lower control limit of control chart R
`UCL.R`	a numeric value giving upper control limit of control chart R

References

Campulova M, Veselik P, Michalek J (2017). Control chart and Six sigma based algorithms for identification of outliers in experimental data, with an application to particulate matter PM10. Atmospheric Pollution Research. Doi=10.1016/j.apr.2017.01.004.

Shewhart W (1931). Quality control chart. Bell System Technical Journal, 5, 593–603.

SAS/QC User's Guide, Version 8, 1999. SAS Institute, Cary, N.C.

Wild C, Seber G (2000). Chance encounters: A first course in data analysis and inference. New York: John Wiley.

Joglekar, Anand M. Statistical methods for six sigma: in R&D and manufacturing. Hoboken, NJ: Wiley-Interscience. ISBN sbn0-471-20342-4.

Gasser T, Kneip A, Kohler W (1991). A flexible and fast method for automatic smoothing. Journal of the American Statistical Association, 86, 643–652.

Herrmann E (1997). Local bandwidth choice in kernel regression estimation. Journal of Computational and Graphical Statistics, 6(1), 35–54.

Eva Herrmann; Packaged for R and enhanced by Martin Maechler (2016). lokern: Kernel Regression Smoothing with Local or Global Plug-in Bandwidth. R package version 1.1-8. https://CRAN.R-project.org/package=lokern

Examples

data("mydata", package = "openair")
x = mydata$o3[format(mydata$date, "%m %Y") == "12 2002"]
result = KRDetect.outliers.controlchart(x)
summary(result)
plot(result)
plot(result, plot.type = "x")
plot(result, plot.type = "R")
plot(result, plot.type = "s")

[Package envoutliers version 1.1.0 Index]