HBmethod {univOutl} | R Documentation |
Hidiroglou-Berthelot procedure for detecting outliers with periodic data
Description
This function implements the method proposed by Hidiroglou and Berthelot (1986) to identify outliers in periodic data, i.e. when the same variable is measured at two time points.
Usage
HBmethod(yt1, yt2, U=0.5, A=0.05, C=4, pct=0.25,
id=NULL, std.score=FALSE, return.dataframe=FALSE, adjboxE=FALSE)
Arguments
yt1 |
Numeric vector providing the values observed at time t1. |
yt2 |
Numeric vector providing the values observed at time t2 (t2 > t1). |
U |
Numeric, parameter needed to determine the ‘importance’ of a ratio. The value should lie in [0, 1] interval; commonly used values are 0.3, 0.4, or 0.5 (default) (see Details for further information). |
A |
Numeric, parameter needed when computing the scale measure used to derive the bounds. Hidiroglou and Berthelot (1986) suggest setting |
C |
Numeric, parameter determining the extension of the interval; greater values will provide larger intervals, i.e. fewer expected outliers. Values commonly used are 4 (default) or 7, but also values close or grater than 40 can be used in some particular cases. Note that two C values can be provided instead of one, the first one will be used to determine the left tail bound, while the second determines the right tail bound; this setting can help in improving outlier detection in skewed distributions (see Details for further information). |
pct |
Numeric, the percentage point of the scores that will be used to calculate the lower and upper bounds. By default, |
id |
Optional numeric or character vector, with identifiers of units. If |
std.score |
Logical, if |
return.dataframe |
Logical, if |
adjboxE |
Logical (default |
Details
The method proposed by Hidiroglou and Berthelot (1986) to identify outliers in periodic data consists in deriving a score variable based on the ratios r_i = y_{i,t2}/y_{i,t1}
(yt2/yt1
) with i=1,2,\ldots, n
being n
the number of observations after discarding NAs and 0s in both yt1
and yt2
.
At first the ratios are centered around their median r_M
:
s_i = 1 - r_M/r_i \qquad \mbox{if} \qquad 0 < r_i < r_M
s_i = r_i/r_M - 1 \qquad \mbox{if} \qquad r_i \geq r_M
Then, in order to account for the magnitude of data, the following score is derived:
E_i = s_i \times [ max(y_{i,t1},y_{i,t2} ) ]^U
Finally, the interval is calculated as:
(E_M-C \times d_{Q1}, E_M+C\times d_{Q3} )
where
d_{Q1} = max( E_M - E_{Q1}, |A \times E_M| )
and d_{Q3} = max( E_{Q3} - E_M, |A \times E_M| )
being E_{Q1}
, E_M
and E_{Q3}
the quartiles of the E scores (when pct = 0.25
, default)).
Recently Hidiroglou and Emond (2018) suggest using percentiles P10 and P90 of the E scores in replacement of respectively Q1 and Q3 to avoid the drawback of many units identified as outliers; this is likely to occur when a large proportion of units (>1/4) has the same ratio. P10 and P90 are achieved by setting pct = 0.10
when running the function.
In practice, all the units with an E score outside the interval are considered as outliers. Notice that when two C
values are provided, then the first is used to derive the left bound while the second determines the right bound.
When std.score=TRUE
a standardized score is derived in the following manner:
z_{E,i} = g \times \frac{E_i - E_M}{d_{Q1}} \qquad \mbox{if} \qquad E_i < E_M
z_{E,i} = g \times \frac{E_i - E_M}{d_{Q3}} \qquad \mbox{if} \qquad E_i \geq E_M
The constant g is set equal to qnorm(1-pct)
and makes d_{Q1}
and d_{Q3}
approximately unbiased estimators when the E scores follow the normal distribution.
When adjboxE = TRUE
outliers on the E scores will all be searched using the boxplot adjusted for skewness as implemented in the function boxB
when run with with argument method = "adjbox"
.
Value
A list whose components depend on the return.dataframe
argument. When return.dataframe=FALSE
just the following components are provided:
median.r |
the median of the ratios |
quartiles.E |
Quartiles of the E score |
bounds.E |
Bounds of the interval of the E score, values outside are considered outliers. |
excluded |
The identifiers or positions (when |
outliers |
The identifiers or positions (when |
outliersBB |
The identifiers or positions (when |
When return.dataframe=TRUE
, the first three components remain the same with, in addition, two dataframes:
excluded |
A dataframe with the subset of observations excluded. The data frame has the following columns: 'id' (units' identifiers), 'yt1' columns 'yt2' |
data |
A dataframe with the the not excluded observations and the following columns: ‘id’ (units' identifiers), ‘yt1’, ‘yt2’, ‘ratio’ (= yt1/yt2), ‘sizeU’ (=max(yt1, yt2)^U),‘Escore’ (the E scores, see Details), ‘std.Escore’ (the standardized E scores when |
Author(s)
Marcello D'Orazio mdo.statmatch@gmail.com
References
Hidiroglou, M.A. and Berthelot, J.-M. (1986) ‘Statistical editing and Imputation for Periodic Business Surveys’. Survey Methodology, Vol 12, pp. 73-83.
Hidiroglou, M.A. and Emond, N. (2018) ‘Modifying the Hidiroglou-Berthelot (HB) method’. Unpublished note, Business Survey Methods Division, Statistics Canada, May 18 2018.
See Also
Examples
set.seed(222)
x0 <- rnorm(30, 50, 5)
x0[1] <- NA
set.seed(333)
rr <- runif(30, 0.9, 1.2)
rr[10] <- 2
x1 <- x0 * rr
x1[20] <- 0
out <- HBmethod(yt1 = x0, yt2 = x1)
out$excluded
out$median.r
out$bounds.E
out$outliers
cbind(x0[out$outliers], x1[out$outliers])
out <- HBmethod(yt1 = x0, yt2 = x1,
return.dataframe = TRUE)
out$excluded
head(out$data)