ratioSize {univOutl}R Documentation

Identifies outliers on ratios and filter them by a size measure

Description

Identifies outliers on transformed ratios (centering with respect to their median) using the adjusted boxplot for skewed distributions. Outliers can be sorted/filtered according to a size measure.

Usage

ratioSize(numerator, denominator, id=NULL,
          size=NULL, U=1, size.th=NULL, return.dataframe=FALSE)

Arguments

numerator

Numeric vector with the values that go at numerator of the ratio

denominator

Numeric vector with the values that go at denominator of the ratio

id

Optional numeric or character vector, with identifiers of units. If id=NULL units identifiers will be set equal to their positions in x.

size

Optional numeric vector providing a measure of the importance of a ratio. If size = NULL the size measure is the maximum value between the numerator and the denominator of each ratio (makes sense if both the variables are observed using the same unit of measure). Observations' importance is also controlled by the argument U.

U

Numeric, constant with 0 < U \leq 1 controlling importance of each unit, in practice the final size measure is derived as (size^U). Commonly used values are 0.4, 0.5 or 1 (default).

size.th

Numeric, size threshold. Can be specified when a size measure is used. In such a case just outliers with a size greater than the threshold will be returned. Note that when argument U is not set equal to 1, then the final threshold will be size.th^U.

return.dataframe

Logical, if TRUE the output will save all the relevant information for outlier detection in a dataframe with the following columns: ‘id’ (units' identifiers), ‘numerator’, ‘denominator’, ‘ratio’ (= numerator/denominator), ‘c.ratio’ (centered ratios, see Details), ‘sizeU’ (size^U values) and finally ‘outliers’, where value 1 indicates observations detected as an outlier and 0 otherwise.

Details

This function searches for outliers starting from ratios r=numerator/denominator. At first the ratios are centered around their median, as in Hidiroglou Berthelot (1986) procedure (see HBmethod), then the outlier identification is based on the adjusted boxplot for skewed distribution (Hubert and Vandervieren 2008) (see adjboxStats). The subset of outliers is sorted in decreasing order according the size measure. If a size threshold is provided then just outliers with (size^U) > (size.th^U) will be returned.

Value

A list whose components depend on the return.dataframe argument. When return.dataframe = FALSE just the following components are returned:

median.r

the median of the ratios

bounds

The bounds of the interval for centered ratios

excluded

The position or the identifiers of the units with values excluded by the computations because of 0s or NAs.

outliers

The position or the identifiers of the units detected as outliers. Remember that when size.th is set, just outliers with (size^U) > (size.th^U) will be returned.

When return.dataframe=TRUE the latter two components are substituted with two dataframes:

excluded

A dataframe with the subset of observations excluded

data

A dataframe with the not excluded observations with the following columns: ‘id’ (units' identifiers), ‘numerator’, ‘denominator’, ‘ratio’ (= numerator/denominator), ‘c.ratio’ (centered ratios, see Details), ‘sizeU’ (size^U values) and finally ‘outliers’, where value 1 indicates observations detected as an outlier and 0 otherwise. The data frame will be sorted in decreasing manner according to size^U. Note that when a size threshold is provided then ONLY outliers with (size^U) > (size.th^U) will be returned.

Author(s)

Marcello D'Orazio mdo.statmatch@gmail.com

References

Hidiroglou, M.A. and Berthelot, J.-M. (1986) ‘Statistical editing and Imputation for Periodic Business Surveys’. Survey Methodology, Vol 12, pp. 73-83.

Hubert, M., and Vandervieren, E. (2008) ‘An Adjusted Boxplot for Skewed Distributions’, Computational Statistics and Data Analysis, 52, pp. 5186-5201.

See Also

HBmethod, plot4ratios, boxB,adjboxStats

Examples


set.seed(444)
x1 <- rnorm(30, 50, 5)
set.seed(555)
rr <- runif(30, 0.9, 1.2)
rr[10] <- 2
x2 <- x1 * rr

out <- ratioSize(numerator = x2, denominator = x1)
out

out <- ratioSize(numerator = x2, denominator = x1,
                 return.dataframe = TRUE)
head(out$data)

out <- ratioSize(numerator = x2, denominator = x1,
                 size.th = 65, return.dataframe = TRUE)
head(out$data)


	

[Package univOutl version 0.4 Index]