R: Identification of outliers using changepoint analysis

KRDetect.outliers.changepoint {envoutliers}

R Documentation

Identification of outliers using changepoint analysis

Description

Identification of outliers in environmental data using method based on kernel smoothing, changepoint analysis of smoothing residuals and subsequent analysis of residuals on homogeneous segments (Campulova et al., 2018).

Usage

KRDetect.outliers.changepoint(x, perform.smoothing = TRUE,
  perform.cp.analysis = TRUE, bandwidth.type = "local",
  bandwidth.value = NULL, kernel.order = 2,
  cp.analysis.type = "parametric", pen.value = "5*log(n)",
  alpha.edivisive = 0.3, min.segment.length = 30,
  segment.length.for.merge = 15, method = "auto",
  prefer.grubbs = TRUE, alpha.default = NULL, L.default = NULL)

Arguments

`x`	data values. Supported data types a numeric vector a time series object `ts` a time series object `xts` a time series object `zoo`
`perform.smoothing`	a logical value specifying if data smoothing is performed. If `TRUE` (default), data are smoothed.
`perform.cp.analysis`	a logical value specifying if changepoint analysis is performed. If `TRUE` (default), smoothing residuals are partitioned into homogeneous segments.
`bandwidth.type`	a character string specifying the type of bandwidth. Possible options are `"local"` (default) to use local bandwidth `"global"` to use global bandwidth
`bandwidth.value`	a local bandwidth array (for `bandwidth.type = "local"`) or global bandwidth value (for `bandwidth.type = "global"`) for kernel regression estimation. If `bandwidth.type = "NULL"` (default) a data-adaptive local plug-in (Herrmann, 1997) (for `bandwidth.type = "local"`) or data-adaptive global plug-in (Gasser et al., 1991) (for `bandwidth.type = "global"`) bandwidth is used instead.
`kernel.order`	a nonnegative integer giving the order of the optimal kernel (Gasser et al., 1985) used for smoothing. Possible options are `kernel.order = 2` (default) `kernel.order = 4`
`cp.analysis.type`	a character string specifying the type of changepoint analysis. Possible options are `"parametric"` (default) to perform changepoint analysis using PELT algorithm (Killick et al., 2012) `"nonparametric"` to perform a nonparametric approach for multiple changepoins (Matteson and James, 2014)
`pen.value`	a character string giving the formula for manual penalty used in PELT algorithm. Only required for `cp.analysis.type = "parametric"`. Default is `pen.value = "5*log(n)"`.
`alpha.edivisive`	a numeric value giving the moment index used for determining the distance between and within segments in nonparametric changepoint model. Default is `alpha.edivisive = 0.3`.
`min.segment.length`	a numeric value giving minimal required number of observations on segments from changepoint analysis. If a segment contains less than `min.segment.length` observations and the variances of data on the segment and the previous one are supposed to be equal (based on Levene´s test (Fox, 2016) for homogeneity of variances), the segment is merged with previous one. Analogous, the first segment can be merged with the second one. Default is `min.segment.length = 30`.
`segment.length.for.merge`	a numeric value giving minimal required number of observations on segments for performing the homogeneity test within changepoint split control. A segment with less data than `segment.length.for.merge` is merged with the previous one without testing the homogeneity of variances (the first segment is merged with the second one). Default is `min.segment.length.for.merge = 15`.
`method`	a character string specifying the method for identification of outlier residuals. Possible options are `"auto"` (default) for automatic selection based on the structure of the residuals `"grubbs.test"` for Grubbs test `"normal.distribution"` for quantiles of normal distribution `"chebyshev.inequality"` for chebyshev inequality
`prefer.grubbs`	a logical variable specyfing if Grubbs test for identification of outlier residuals is preferred to quantiles of normal distribution. `TRUE` (default) means that Grubbs test is preferred. Only required for `method = "auto"`.
`alpha.default`	a numeric value from interval (0,1) of alpha parameter determining the criterion for (residual) outlier detection: the limits for outlier residuals on individual segments are set as `+/- (alpha/2-quantile of normal distribution with parameters corresponding to residuals on studied segment) * (sample standard deviation of residuals on corresponding segment)`. If `alpha.default = NULL` (default), its value on individual segments is estimated using Modified Algorithm A1 (Campulova et al., 2018).
`L.default`	a numeric value of L parameter determining the criterion for outlier (residual) detection: the limits for outlier residuals on individual segments are set as `+/- L * sample standard deviation of residuals on corresponding segment`. If `L.default = NULL` (default), its value on individual segments is estimated using Algorithm A1 (Campulova et al., 2018).

Details

This function identifies outliers in time series using procedure based on kernel smoothing, changepoint analysis of smoothing residuals and subsequent analysis of residuals on homogeneous segments (Campulova et al., 2018). Three different approaches (Grubbs test, quantiles of normal distribution, Chebyshev inequality), that can be selected automatically based on data structure or specified by the user, can be used to detect outlier residuals. Crucial for the method is the choice of parameters alpha and L for quantiles of normal distribution and Chebyshev inequality approach, that define the criterion for outlier detection. These values can be specified by the user or estimated automatically using data driven algorithms (Campulova et al., 2018).

Value

A "KRDetect" object which contains a list with elements:

`method.type`	a character string giving the type of method used for outlier idetification
`x`	a numeric vector of observations
`index`	a numeric vector of index design points assigned to individual observations
`smoothed`	a numeric vector of estimates of the kernel regression function (smoothed data)
`changepoints`	an integer membership vector for individual segments
`normality.results`	a data.frame of normality results of residuals on individual segments
`detection.method`	a character string giving the type of method used for identification of outlier residuals
`alpha`	a numeric vector of alpha parameters used for outlier identification on individual segments
`L`	a numeric vector of L parameters used for outlier identification on individual segments
`outlier`	a logical vector specyfing the identified outliers, `TRUE` means that corresponding observation from vector `x` is detected as outlier

References

Campulova M, Michalek J, Mikuska P, Bokal D (2018). Nonparametric algorithm for identification of outliers in environmental data. Journal of Chemometrics, 32, 453-463.

Gasser T, Kneip A, Kohler W (1991). A flexible and fast method for automatic smoothing. Journal of the American Statistical Association, 86, 643–652.

Herrmann E (1997). Local bandwidth choice in kernel regression estimation. Journal of Computational and Graphical Statistics, 6(1), 35–54.

Eva Herrmann; Packaged for R and enhanced by Martin Maechler (2016). lokern: Kernel Regression Smoothing with Local or Global Plug-in Bandwidth. R package version 1.1-8. https://CRAN.R-project.org/package=lokern.

Killick R, Fearnhead P, Eckley IA (2012). Optimal detection of changepoints with a linear computational cost. Journal of the American Statistical Association, 107(500), 1590–1598.

Killick R, Haynes K, Eckley IA (2016). changepoint: An R package for changepoint analysis. R package version 2.2.2, <URL: https://CRAN.R-project.org/package=changepoint>.

Matteson D, James N (2014). A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data. Journal of the American Statistical Association, 109(505), 334–345.

Nicholas A. James, David S. Matteson (2014). ecp: An R Package for Nonparametric Multiple Change Point Analysis of Multivariate Data. Journal of Statistical Software, 62(7), 1-25, URL "http://www.jstatsoft.org/v62/i07/".

Brys G, Hubert M, Struyf A (2008). Goodness-of-fit tests based on a robust measure of skewness. Computational Statistics, 23(3), 429–442.

Todorov V, Filzmoser P (2009). An Object-Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software, 32(3), 1-47. URL http://www.jstatsoft.org/v32/i03/.

Box G, Cox D (1964). An analysis of transformations. Journal of the Royal Statistical Society: Series B, 26, 211–234.

Venables WN, Ripley BD (2002). Modern Applied Statistics with S. New York, fourth edition. ISBN 0-387-95457-0, URL http://www.stats.ox.ac.uk/pub/MASS4.

Grubbs F (1950). Sample criteria for testing outlying observations. The Annals of Mathematical Statistics, 21(1), 27-58.

Fox J (2016). Applied regression analysis and generalized linear models. 3 edition. Los Angeles: SAGE. ISBN 9781452205663.

Examples

data("mydata", package = "openair")
x = mydata$o3[format(mydata$date, "%m %Y") == "12 2002"]
result = KRDetect.outliers.changepoint(x)
summary(result)
plot(result)
plot(result, show.segments = FALSE)

[Package envoutliers version 1.1.0 Index]