detect_outliers {tsrobprep} | R Documentation |
Detects unreliable outliers in univariate time series data based on model-based clustering
Description
This function applies finite mixture modelling to compute the probability of each observation being outliying data in an univariate time series. By utilizing the Mclust package the data is assigned in G clusters whereof one is modelled as an outlier cluster. The clustering process is based on features, which are modelled to differentiate normal from outlying observation.Beside computing the probability of each observation being outlying data also the specific cause in terms of the responsible feature/ feature combination can be provided.
Usage
detect_outliers(
data,
S,
proba = 0.5,
share = NULL,
repetitions = 10,
decomp = T,
PComp = F,
detection.parameter = 1,
out.par = 2,
max.cluster = 9,
G = NULL,
modelName = "VVV",
feat.inf = F,
ext.val = 1,
...
)
Arguments
data |
an one dimensional matrix or data frame without missing data; each row is an observation. |
S |
vector with numeric values for each seasonality present in data. |
proba |
denotes the threshold from which on an observation is considered as being outlying data. By default is set to 0.5 (ranging from 0 to 1). Number of outliers increases with decrease of proba threshold. |
share |
controlls the size of the subsample used for estimation. By default set to pmin(2*round(length(data)^(sqrt(2)/2)), length(data))/length(data) (ranging from 0 to 1). In combination with the repetitions parameter the robustness and computational time of the method can be controlled. |
repetitions |
denotes the number of repetitions to repeat the clustering. By default set to 10. Allows to control the robustness and computational time of the method. |
decomp |
allows to perform seasonal decomposition on the original time series as pre- processing step before feature modelling. By default set to TRUE. |
PComp |
allows to use the principal components of the modelled feature matrix. By default set to FALSE. |
detection.parameter |
denotes a parameter to regulate the detection sensitivity. By default set to 1. It is assumed that the outlier cluster follows a (multivariate) Gaussian distribution parameterized by sample mean and a blown up sample covariance matrix of the feature space. The covariance matrix is blown up by detection.parameter * (2 * log(length(data)))^2. By increase the more extrem outliers are detected. |
out.par |
controls the number of artifially produced outliers to allow cluster formation of oultier cluster. By default out.par ist set to 2. By increase it is assumed that share of outliers in data increases. A priori it is assumed that out.par * ceiling(sqrt(nrow(data.original))) number of observations are outlying observations. |
max.cluster |
a single numeric value controlling the maximum number of allowed clusters. By default set to 9. |
G |
denotes the optimal number of clusters limited by the max.cluster paramter. By default G is set to NULL and is automatically calculated based on the BIC. |
modelName |
denotes the geometric features of the covariance matrix. i.e. "EII", "VII", "EEI", "EVI", "VEI", "VVI", etc.. By default modelName is set to "VVV". The help file for mclustModelNames describes the available models. Choice of modelName influences the fit to the data as well as the computational time. |
feat.inf |
logical value indicating whether influential features/ feature combinations should be computed. By default set to FALSE. |
ext.val |
denotes the number of observations for each side of an identified outlier, which should also be treated as outliyng data. By default set to 1. |
... |
additional arguments for the Mclust function. |
Details
The detection of outliers is addressed by
model based clustering based on parameterized finite Gaussian mixture models.
For cluster estimation the Mclust function is applied.
Models are estimated by the EM algorithm initialized by hierarchical
model-based agglomerative clustering. The optimal model is selected
according to BIC.
The following features based on the introduced data are used in the clustering process:
- org.series
denotes the scaled and potantially decomposed original time series.
- seasonality
denotes determenistic seasonalities based on S.
- gradient
denotes the summation of the two sided gradient of the org.series.
- abs.gradient
denotes the summation of the absolute two sided gradient of org.series.
- rel.gradient
denotes the summation of the two sided absolute gradient of the org.series with sign based on left sided gradient in relation to the rolling mean absolut deviation based on most relevant seasonality S.
- abs.seas.grad
denotes the summation of the absolute two sided seasonal gradient of org.series based on seasonalties S.
In case PComp = TRUE, the features correspond to the principal components of the introduced feature space.
Value
a list containing the following elements:
data |
numeric vector containing the original data. |
outlier.pos |
a vector indicating the position of each outlier and the corresponding neighboorhood controled by ext.val. |
outlier.pos.raw |
a vector indicating the position of each outlier. |
outlier.probs |
a vector containing all probabilities for each observation being outlying data. |
Repetitions |
provides a list for each repetition containing the estimated model, the outlier cluster, the probabilities for each observation belonging to the estimated clusters, the outlier position, the influence of each feature/ feature combination on the identified outyling data, and the corresponding probabilities after shift to the feature mean of each considered outlier, as well as the applied subset of the extended feature matrix for estimation (including artificially introduced outliers). |
features |
a matrix containg the feature matrix. Each column is a feature. |
inf.feature.combinations |
a list containg the features/ feature comibinations, which caused assignment to outlier cluster. |
feature.inf.tab |
a matrix containing all possible feature combinations. |
PC |
an object of class "princomp" containing the principal component analysis of the feature matrix. |
References
Narajewski M, Kley-Holsteg J, Ziel F (2021). “tsrobprep — an R package for robust preprocessing of time series data.” SoftwareX, 16, 100809. doi: 10.1016/j.softx.2021.100809.
See Also
model_missing_data
,
impute_modelled_data,
auto_data_cleaning
Examples
## Not run:
set.seed(1)
id <- 14000:17000
# Replace missing values
modelmd <- model_missing_data(data = GBload[id, -1], tau = 0.5,
S = c(48, 336), indices.to.fix = seq_len(nrow(GBload[id, ])),
consider.as.missing = 0, min.val = 0)
# Impute missing values
data.imputed <- impute_modelled_data(modelmd)
#Detect outliers
system.time(
o.ident <- detect_outliers(data = data.imputed, S = c(48, 336))
)
# Plot of identified outliers in time series
outlier.vector <- rep(F,length(data.imputed))
outlier.vector[o.ident$outlier.pos] <- T
plot(data.imputed, type = "o", col=1 + 1 * outlier.vector,
pch = 1 + 18 * outlier.vector)
# table of identified raw outliers and corresponding probs being outlying data
df <- data.frame(o.ident$outlier.pos.raw,unlist(o.ident$outlier.probs)[o.ident$outlier.pos.raw])
colnames(df) <- c("Outlier position", "Probability of being outlying data")
df
# Plot of feature matrix
plot.ts(o.ident$features, type = "o",
col = 1 + outlier.vector,
pch = 1 + 1 * outlier.vector)
# table of outliers and corresponding features/ feature combinations,
# which caused assignment to outlier cluster
# Detect outliers with feat.int = T
set.seed(1)
system.time(
o.ident <- detect_outliers(data = data.imputed, S = c(48, 336), feat.inf = T)
)
feature.imp <- unlist(lapply(o.ident$inf.feature.combinations,
function(x) paste(o.ident$feature.inf.tab[x], collapse = " | ")))
df <- data.frame(o.ident$outlier.pos.raw,o.ident$outlier.probs[o.ident$outlier.pos.raw],
feature.imp[as.numeric(names(feature.imp)) %in% o.ident$outlier.pos.raw])
colnames(df) <- c("Outlier position", "Probability being outlying data", "Responsible features")
View(df)
## End(Not run)