acc_univariate_outlier {dataquieR} R Documentation

## Function to identify univariate outliers by four different approaches

### Description

A classical but still popular approach to detect univariate outlier is the boxplot method introduced by Tukey 1977. The boxplot is a simple graphical tool to display information about continuous univariate data (e.g., median, lower and upper quartile). Outliers are defined as values deviating more than 1.5 \times IQR from the 1st (Q25) or 3rd (Q75) quartile. The strength of Tukey’s method is that it makes no distributional assumptions and thus is also applicable to skewed or non mound-shaped data Marsh and Seo, 2006. Nevertheless, this method tends to identify frequent measurements which are falsely interpreted as true outliers.

A somewhat more conservative approach in terms of symmetric and/or normal distributions is the 6 * σ approach, i.e. any measurement not in the interval of mean(x) +/- 3 * σ is considered an outlier.

Both methods mentioned above are not ideally suited to skewed distributions. As many biomarkers such as laboratory measurements represent in skewed distributions the methods above may be insufficient. The approach of Hubert and Vandervieren 2008 adjusts the boxplot for the skewness of the distribution. This approach is implemented in several R packages such as robustbase::mc which is used in this implementation of dataquieR.

Another completely heuristic approach is also included to identify outliers. The approach is based on the assumption that the distances between measurements of the same underlying distribution should homogeneous. For comprehension of this approach:

• consider an ordered sequence of all measurements.

• between these measurements all distances are calculated.

• the occurrence of larger distances between two neighboring measurements may than indicate a distortion of the data. For the heuristic definition of a large distance 1 * σ has been been chosen.

### Usage

acc_univariate_outlier(
resp_vars = NULL,
label_col,
study_data,
meta_data,
exclude_roles,
n_rules = 4
)


### Arguments

 resp_vars variable list the name of the continuous measurement variable label_col variable attribute the name of the column in the metadata with labels of variables study_data data.frame the data frame that contains the measurements meta_data data.frame the data frame that contains metadata attributes of study data exclude_roles variable roles a character (vector) of variable roles not included n_rules integer from=1 to=4. the no. of rules that must be violated to flag a variable as containing outliers. The default is 4, i.e. all.

### Value

a list with:

• SummaryTable: data.frame with the columns Variables, Mean, SD, Median, Skewness, Tukey (N), 6-Sigma (N), Hubert (N), Sigma-gap (N), Most likely (N), To low (N), To high (N) Grading

• SummaryPlotList: ggplot2 univariate outlier plots

### ALGORITHM OF THIS IMPLEMENTATION:

• Select all variables of type float in the study data

• Remove missing codes from the study data (if defined in the metadata)

• Remove measurements deviating from limits defined in the metadata

• Identify outlier according to the approaches of Tukey (Tukey 1977), SixSigma (-Bakar et al. 2006), Hubert (Hubert and Vandervieren 2008), and SigmaGap (heuristic)

• A output data frame is generated which indicates the no. of possible outlier, the direction of deviations (to low, to high) for all methods and a summary score which sums up the deviations of the different rules

• A scatter plot is generated for all examined variables, flagging observations according to the no. of violated rules (step 5).