R: Identification of outliers using extreme value theory

KRDetect.outliers.EV {envoutliers}

R Documentation

Identification of outliers using extreme value theory

Description

Identification of outliers in environmental data using semiparametric method based on kernel smoothing and extreme value theory (Holesovsky et al., 2018). The outliers are identified as observations whose values are exceeded on average once a given period that is specified by the user.

Usage

KRDetect.outliers.EV(x, perform.smoothing = TRUE,
  bandwidth.type = "local", bandwidth.value = NULL, kernel.order = 2,
  gpd.fit.method = "mle", threshold.min = NULL, threshold.max = NULL,
  k.min = round(length(na.omit(x)) * 0.1),
  k.max = round(length(na.omit(x)) * 0.1), extremal.index.min = NULL,
  extremal.index.max = NULL, extremal.index.type = "block.maxima",
  block.length.min = round(sqrt(length(na.omit(x)))),
  block.length.max = round(sqrt(length(na.omit(x)))), D.min = NULL,
  D.max = NULL, K.min = NULL, K.max = NULL, r.min = NULL,
  r.max = NULL, return.period = 120)

Arguments

`x`	data values. Supported data types a numeric vector a time series object `ts` a time series object `xts` a time series object `zoo`
`perform.smoothing`	a logical value specifying if data smoothing is performed. If `TRUE` (default), data are smoothed.
`bandwidth.type`	a character string specifying the type of bandwidth. Possible options are `"local"` (default) to use local bandwidth `"global"` to use global bandwidth
`bandwidth.value`	a local bandwidth array (for `bandwidth.type = "local"`) or global bandwidth value (for `bandwidth.type = "global"`) for kernel regression estimation. If `bandwidth.type = "NULL"` (default) a data-adaptive local plug-in (Herrmann, 1997) (for `bandwidth.type = "local"`) or data-adaptive global plug-in (Gasser et al., 1991) (for `bandwidth.type = "global"`) bandwidth is used instead.
`kernel.order`	a nonnegative integer giving the order of the optimal kernel (Gasser et al., 1985) used for smoothing. Possible options are `kernel.order = 2` (default) `kernel.order = 4`
`gpd.fit.method`	a character string specifying the method used for the estimate of the scale and shape parameters of GP distribution. Possible options are `"mle"` (default) for maximum likelihood estimates (Coles, 2001) `"moment"` for moment estimates (de Haan and Ferreira2006)
`threshold.min`	a threshold value for residuals with low values, that is used to find the maximum likelihood estimates of shape and scale parameters of GP distribution and selected types of extremal index estimates (specifically: Intervals estimator (Ferro and Segers, 2003), censored estimator, (Holesovsky and Fusek, 2020), K-gaps estimator (Suveges and Davison, 2010), runs estimator (Smith and Weissman, 1994)). If `threshold.min = NULL` (default), threshold is estimated as 90% quantile of smoothing residuals.
`threshold.max`	a threshold value for residuals with high values, that is used to find the maximum likelihood estimates of shape and scale parameters of GP distribution and selected types of extremal index estimates (specifically: Intervals estimator (Ferro and Segers, 2003), censored estimator, (Holesovsky and Fusek, 2020), K-gaps estimator (Suveges and Davison, 2010), runs estimator (Smith and Weissman, 1994)). If `threshold.max = NULL` (default), threshold is estimated as 90% quantile of smoothing residuals.
`k.min`	a positive integer for residuals with low values giving the number of largest order statistics used to find the moment estimates (de Haan and Ferreira, 2006) of shape and scale parameters of GP distribution. Default is `k.min = round(length(x) * 0.1)`.
`k.max`	a positive integer for residuals with high values giving the number of largest order statistics used to find the moment estimates (de Haan and Ferreira, 2006) of shape and scale parameters of GP distribution. Default is `k.max = round(length(x) * 0.1)`.
`extremal.index.min`	a numeric value giving the extremal index for identification of outliers with extremely low value. If `extremal.index.min = NULL` (default), the extremal index is estimated using the method specified by the parameter `extremal.index.type`.
`extremal.index.max`	a numeric value giving the extremal index for identification of outliers with extremely high value. If `extremal.index.max = NULL` (default), the extremal index is estimated using the method specified by the parameter `extremal.index.type`.
`extremal.index.type`	a character string specifying the type of extremal index estimate. Possible options are `"block.maxima"` (default) for block maxima estimator (Gomes, 1993). `"intervals"` for intervals estimator (Ferro and Segers, 2003). `"censored"` for censored estimator (Holesovsky and Fusek, 2020). `"Kgaps"` for K-gaps estimator (Suveges and Davison, 2010). `"sliding.blocks"` for sliding blocks estimator (Northrop, 2015). `"runs"` for runs estimator (Smith and Weissman, 1994).
`block.length.min`	a numeric value for residuals with low values giving the length of blocks for estimation of extremal index. Only required for `extremal.index.type = "block.maxima"` and `extremal.index.type = "sliding.blocks"`. Default is `block.length.min = round(sqrt(length(x)))`.
`block.length.max`	a numeric value for residuals with high values giving the length of blocks for estimation of extremal index. Only required for `extremal.index.type = "block.maxima"` and `extremal.index.type = "sliding.blocks"`. Default is `block.length.max = round(sqrt(length(x)))`.
`D.min`	a nonnegative integer for residuals with low values giving the value of D parameter used for censored extremal index estimate (Holesovsky and Fusek, 2020). Only required for `extremal.index.type = "censored"`.
`D.max`	a nonnegative integer for residuals with high values giving the value of D parameter used for censored extremal index estimate (Holesovsky and Fusek, 2020). Only required for `extremal.index.type = "censored"`.
`K.min`	a nonnegative integer for residuals with low values giving the value of K parameter used for K-gaps extremal index estimate (Suveges and Davison, 2010). Only required for `extremal.index.type = "Kgaps"`.
`K.max`	a nonnegative integer for residuals with high values giving the value of K parameter used for K-gaps extremal index estimate (Suveges and Davison, 2010). Only required for `extremal.index.type = "Kgaps"`.
`r.min`	a positive integer for residuals with low values giving the value of runs parameter of runs extremal index estimate (Smith and Weissman, 1994). Only required for `extremal.index.type = "runs"`.
`r.max`	a positive integer for residuals with high values giving the value of runs parameter of runs extremal index estimate (Smith and Weissman, 1994). Only required for `extremal.index.type = "runs"`.
`return.period`	a positive numeric value giving return period. Default is `r = 120`, which means that observations whose values are exceeded on average once every 120 observations are detected as outliers.

Details

This function identifies outliers in time series using two-step procedure (Holesovsky et al., 2018). The procedure consists of kernel smoothing and extreme value estimation of high threshold exceedances for smoothing residuals. Outliers with both extremely high and extremely low values are identified. Crucial for the method is the choice of return period - parameter defining the criterion for outliers detection. The outliers with extremely high values are detected as observations whose values are exceeded on average once a given return.period of observations. Analogous, the outliers with extremely low values are identified.

Value

A "KRDetect" object which contains a list with elements:

`method.type`	a character string giving the type of method used for outlier idetification
`x`	a numeric vector of observations
`index`	a numeric vector of index design points assigned to individual observations
`smoothed`	a numeric vector of estimates of the kernel regression function (smoothed data)
`GPD.fit.method`	the method used for the estimate of the scale and shape parameters of GP distribution
`extremal.index.type`	the type of extremal index estimate used for the identification of outliers
`sigma.min`	a numeric value giving scale parameter of Generalised Pareto distribution used for identification of outliers with extremely low value
`sigma.max`	a numeric value giving scale parameter of Generalised Pareto distribution used for identification of outliers with extremely high value
`xi.min`	a numeric value giving shape parameter of Generalised Pareto distribution used for identification of outliers with extremely low value
`xi.max`	a numeric value giving shape parameter of Generalised Pareto distribution used for identification of outliers with extremely high value
`lambda_u.min`	a numeric value giving relative frequency of the number of threshold value exceedances and identification of outliers with extremely low value. The value of the parameter is returned only for `gpd.fit.method = "mle"`.
`lambda_u.max`	a numeric value giving relative frequency of the number of threshold value exceedances and identification of outliers with extremely high value. The value of the parameter is returned only for `gpd.fit.method = "mle"`.
`extremal.index.min`	a numeric value giving extremal index used for identification of outliers with extremely low value
`extremal.index.max`	a numeric value giving extremal index used for identification of outliers with extremely high value
`threshold.min`	a numeric value giving threshold value used for identification of outliers with extremely low value.
`threshold.max`	a numeric value giving threshold value used for identification of outliers with extremely high value.
`return.level.min`	a numeric value giving return level used for identification of outliers with extremely low value
`return.level.max`	a numeric value giving return level used for identification of outliers with extremely high value
`outlier.min`	a logical vector specyfing the identified outliers with extremely low value. `TRUE` means that corresponding observation from vector `x` is detected as outlier
`outlier.max`	a logical vector specyfing the identified outliers with extremely high value. `TRUE` means that corresponding observation from vector `x` is detected as outlier
`outlier`	a logical vector specyfing the identified outliers with both extremely low and extremely high value. `TRUE` means that corresponding observation from vector `x` is detected as outlier

References

Holesovsky J, Campulova M, Michalek J (2018). Semiparametric Outlier Detection in Nonstationary Times Series: Case Study for Atmospheric Pollution in Brno, Czech Republic. Atmospheric Pollution Research, 9(1).

Theo Gasser, Alois Kneip & Walter Koehler (1991) A flexible and fast method for automatic smoothing. Journal of the American Statistical Association 86, 643-652. https://doi.org/10.2307/2290393

E. Herrmann (1997) Local bandwidth choice in kernel regression estimation. Journal of Graphical and Computational Statistics 6, 35-54.

Herrmann E, Maechler M (2013). lokern: Kernel Regression Smoothing with Local or Global Plug-in Bandwidth. R package version 1.1-5, URL http://CRAN.R-project.org/package=lokern.

Gasser, T, Muller, H-G, Mammitzsch, V (1985). Kernels for nonparametric curve estimation. Journal of the Royal Statistical Society, B Met., 47(2), 238-252.

Gomes M (1993). On the estimation of parameter of rare events in environmental time series. In Statistics for the Environment, volume 2 of Water Related Issues, pp. 225-241. Wiley.

Ferro, CAT, Segers, J (2003). Inference for Cluster of Extreme Values. Journal of Royal Statistical Society, Series B, 65(2), 545-556.

Holesovsky, J, Fusek, M (2020). Estimation of the Extremal Index Using Censored Distributions. Extremes, In Press.

Suveges, M, Davison, AC (2010). Model Misspecification in Peaks Over Threshold Analysis. The Annals of Applied Statistics, 4(1), 203-221.

Northrop, PJ (2015). An Efficient Semiparametric Maxima Estimator of the Extremal Index. Extremes, 18, 585-603.

Smith, RL, Weissman, I (1994). Estimating the Extremal Index. Journal of the Royal Statistical Society, Series B, 56, 515-529.

Heffernan JE, Stephenson AG (2016). ismev: An Introduction to Statistical Modeling of Extreme Values. R package version 1.41, URL http://CRAN.R-project.org/package=ismev.

Coles S (2001). An Introduction to Statistical Modeling of Extreme Values. 3 edition. London: Springer. ISBN 1-85233-459-2.

de Haan, L, Ferreira, A (2006). Extreme Value Theory: An Introduction. Springer.

Pickands J (1975). Statistical inference using extreme order statistics. The Annals of Statistics, 3(1), 119-131.

Examples

data("mydata", package = "openair")
x = mydata$o3[format(mydata$date, "%m %Y") == "12 2002"]
result = KRDetect.outliers.EV(x)
summary(result)
plot(result)
plot(result, plot.type = "min")
plot(result, plot.type = "max")

[Package envoutliers version 1.1.0 Index]