KRDetect.outliers.EV {envoutliers} | R Documentation |
Identification of outliers using extreme value theory
Description
Identification of outliers in environmental data using semiparametric method based on kernel smoothing and extreme value theory (Holesovsky et al., 2018). The outliers are identified as observations whose values are exceeded on average once a given period that is specified by the user.
Usage
KRDetect.outliers.EV(x, perform.smoothing = TRUE,
bandwidth.type = "local", bandwidth.value = NULL, kernel.order = 2,
gpd.fit.method = "mle", threshold.min = NULL, threshold.max = NULL,
k.min = round(length(na.omit(x)) * 0.1),
k.max = round(length(na.omit(x)) * 0.1), extremal.index.min = NULL,
extremal.index.max = NULL, extremal.index.type = "block.maxima",
block.length.min = round(sqrt(length(na.omit(x)))),
block.length.max = round(sqrt(length(na.omit(x)))), D.min = NULL,
D.max = NULL, K.min = NULL, K.max = NULL, r.min = NULL,
r.max = NULL, return.period = 120)
Arguments
x |
data values. Supported data types
|
perform.smoothing |
a logical value specifying if data smoothing is performed. If |
bandwidth.type |
a character string specifying the type of bandwidth. Possible options are
|
bandwidth.value |
a local bandwidth array (for |
kernel.order |
a nonnegative integer giving the order of the optimal kernel (Gasser et al., 1985) used for smoothing. Possible options are
|
gpd.fit.method |
a character string specifying the method used for the estimate of the scale and shape parameters of GP distribution. Possible options are
|
threshold.min |
a threshold value for residuals with low values, that is used to find the maximum likelihood estimates of shape and scale parameters of GP distribution and selected types of extremal index estimates (specifically: Intervals estimator (Ferro and Segers, 2003), censored estimator, (Holesovsky and Fusek, 2020), K-gaps estimator (Suveges and Davison, 2010), runs estimator (Smith and Weissman, 1994)). If |
threshold.max |
a threshold value for residuals with high values, that is used to find the maximum likelihood estimates of shape and scale parameters of GP distribution and selected types of extremal index estimates (specifically: Intervals estimator (Ferro and Segers, 2003), censored estimator, (Holesovsky and Fusek, 2020), K-gaps estimator (Suveges and Davison, 2010), runs estimator (Smith and Weissman, 1994)). If |
k.min |
a positive integer for residuals with low values giving the number of largest order statistics used to find the moment estimates (de Haan and Ferreira, 2006) of shape and scale parameters of GP distribution. Default is |
k.max |
a positive integer for residuals with high values giving the number of largest order statistics used to find the moment estimates (de Haan and Ferreira, 2006) of shape and scale parameters of GP distribution. Default is |
extremal.index.min |
a numeric value giving the extremal index for identification of outliers with extremely low value. If |
extremal.index.max |
a numeric value giving the extremal index for identification of outliers with extremely high value. If |
extremal.index.type |
a character string specifying the type of extremal index estimate. Possible options are
|
block.length.min |
a numeric value for residuals with low values giving the length of blocks for estimation of extremal index. Only required for |
block.length.max |
a numeric value for residuals with high values giving the length of blocks for estimation of extremal index. Only required for |
D.min |
a nonnegative integer for residuals with low values giving the value of D parameter used for censored extremal index estimate (Holesovsky and Fusek, 2020). Only required for |
D.max |
a nonnegative integer for residuals with high values giving the value of D parameter used for censored extremal index estimate (Holesovsky and Fusek, 2020). Only required for |
K.min |
a nonnegative integer for residuals with low values giving the value of K parameter used for K-gaps extremal index estimate (Suveges and Davison, 2010). Only required for |
K.max |
a nonnegative integer for residuals with high values giving the value of K parameter used for K-gaps extremal index estimate (Suveges and Davison, 2010). Only required for |
r.min |
a positive integer for residuals with low values giving the value of runs parameter of runs extremal index estimate (Smith and Weissman, 1994). Only required for |
r.max |
a positive integer for residuals with high values giving the value of runs parameter of runs extremal index estimate (Smith and Weissman, 1994). Only required for |
return.period |
a positive numeric value giving return period. Default is |
Details
This function identifies outliers in time series using two-step procedure (Holesovsky et al., 2018). The procedure consists of kernel smoothing and extreme value estimation of high threshold exceedances for smoothing residuals. Outliers with both extremely high and extremely low values are identified. Crucial for the method is the choice of return period - parameter defining the criterion for outliers detection. The outliers with extremely high values are detected as observations whose values are exceeded on average once a given return.period of observations. Analogous, the outliers with extremely low values are identified.
Value
A "KRDetect"
object which contains a list with elements:
method.type |
a character string giving the type of method used for outlier idetification |
x |
a numeric vector of observations |
index |
a numeric vector of index design points assigned to individual observations |
smoothed |
a numeric vector of estimates of the kernel regression function (smoothed data) |
GPD.fit.method |
the method used for the estimate of the scale and shape parameters of GP distribution |
extremal.index.type |
the type of extremal index estimate used for the identification of outliers |
sigma.min |
a numeric value giving scale parameter of Generalised Pareto distribution used for identification of outliers with extremely low value |
sigma.max |
a numeric value giving scale parameter of Generalised Pareto distribution used for identification of outliers with extremely high value |
xi.min |
a numeric value giving shape parameter of Generalised Pareto distribution used for identification of outliers with extremely low value |
xi.max |
a numeric value giving shape parameter of Generalised Pareto distribution used for identification of outliers with extremely high value |
lambda_u.min |
a numeric value giving relative frequency of the number of threshold value exceedances and identification of outliers with extremely low value. The value of the parameter is returned only for |
lambda_u.max |
a numeric value giving relative frequency of the number of threshold value exceedances and identification of outliers with extremely high value. The value of the parameter is returned only for |
extremal.index.min |
a numeric value giving extremal index used for identification of outliers with extremely low value |
extremal.index.max |
a numeric value giving extremal index used for identification of outliers with extremely high value |
threshold.min |
a numeric value giving threshold value used for identification of outliers with extremely low value. |
threshold.max |
a numeric value giving threshold value used for identification of outliers with extremely high value. |
return.level.min |
a numeric value giving return level used for identification of outliers with extremely low value |
return.level.max |
a numeric value giving return level used for identification of outliers with extremely high value |
outlier.min |
a logical vector specyfing the identified outliers with extremely low value. |
outlier.max |
a logical vector specyfing the identified outliers with extremely high value. |
outlier |
a logical vector specyfing the identified outliers with both extremely low and extremely high value. |
References
Holesovsky J, Campulova M, Michalek J (2018). Semiparametric Outlier Detection in Nonstationary Times Series: Case Study for Atmospheric Pollution in Brno, Czech Republic. Atmospheric Pollution Research, 9(1).
Theo Gasser, Alois Kneip & Walter Koehler (1991) A flexible and fast method for automatic smoothing. Journal of the American Statistical Association 86, 643-652. https://doi.org/10.2307/2290393
E. Herrmann (1997) Local bandwidth choice in kernel regression estimation. Journal of Graphical and Computational Statistics 6, 35-54.
Herrmann E, Maechler M (2013). lokern: Kernel Regression Smoothing with Local or Global Plug-in Bandwidth. R package version 1.1-5, URL http://CRAN.R-project.org/package=lokern.
Gasser, T, Muller, H-G, Mammitzsch, V (1985). Kernels for nonparametric curve estimation. Journal of the Royal Statistical Society, B Met., 47(2), 238-252.
Gomes M (1993). On the estimation of parameter of rare events in environmental time series. In Statistics for the Environment, volume 2 of Water Related Issues, pp. 225-241. Wiley.
Ferro, CAT, Segers, J (2003). Inference for Cluster of Extreme Values. Journal of Royal Statistical Society, Series B, 65(2), 545-556.
Holesovsky, J, Fusek, M (2020). Estimation of the Extremal Index Using Censored Distributions. Extremes, In Press.
Suveges, M, Davison, AC (2010). Model Misspecification in Peaks Over Threshold Analysis. The Annals of Applied Statistics, 4(1), 203-221.
Northrop, PJ (2015). An Efficient Semiparametric Maxima Estimator of the Extremal Index. Extremes, 18, 585-603.
Smith, RL, Weissman, I (1994). Estimating the Extremal Index. Journal of the Royal Statistical Society, Series B, 56, 515-529.
Heffernan JE, Stephenson AG (2016). ismev: An Introduction to Statistical Modeling of Extreme Values. R package version 1.41, URL http://CRAN.R-project.org/package=ismev.
Coles S (2001). An Introduction to Statistical Modeling of Extreme Values. 3 edition. London: Springer. ISBN 1-85233-459-2.
de Haan, L, Ferreira, A (2006). Extreme Value Theory: An Introduction. Springer.
Pickands J (1975). Statistical inference using extreme order statistics. The Annals of Statistics, 3(1), 119-131.
Examples
data("mydata", package = "openair")
x = mydata$o3[format(mydata$date, "%m %Y") == "12 2002"]
result = KRDetect.outliers.EV(x)
summary(result)
plot(result)
plot(result, plot.type = "min")
plot(result, plot.type = "max")