check_outliers {performance}R Documentation

Outliers detection (check for influential observations)

Description

Checks for and locates influential observations (i.e., "outliers") via several distance and/or clustering methods. If several methods are selected, the returned "Outlier" vector will be a composite outlier score, made of the average of the binary (0 or 1) results of each method. It represents the probability of each observation of being classified as an outlier by at least one method. The decision rule used by default is to classify as outliers observations which composite outlier score is superior or equal to 0.5 (i.e., that were classified as outliers by at least half of the methods). See the Details section below for a description of the methods.

Usage

check_outliers(x, ...)

## Default S3 method:
check_outliers(
  x,
  method = c("cook", "pareto"),
  threshold = NULL,
  ID = NULL,
  verbose = TRUE,
  ...
)

## S3 method for class 'numeric'
check_outliers(x, method = "zscore_robust", threshold = NULL, ...)

## S3 method for class 'data.frame'
check_outliers(x, method = "mahalanobis", threshold = NULL, ID = NULL, ...)

## S3 method for class 'performance_simres'
check_outliers(
  x,
  type = "default",
  iterations = 100,
  alternative = "two.sided",
  ...
)

Arguments

x

A model, a data.frame, a performance_simres simulate_residuals() or a DHARMa object.

...

When method = "ics", further arguments in ... are passed down to ICSOutlier::ics.outlier(). When method = "mahalanobis", they are passed down to stats::mahalanobis(). percentage_central can be specified when method = "mcd". For objects of class performance_simres or DHARMa, further arguments are passed down to DHARMa::testOutliers().

method

The outlier detection method(s). Can be "all" or some of "cook", "pareto", "zscore", "zscore_robust", "iqr", "ci", "eti", "hdi", "bci", "mahalanobis", "mahalanobis_robust", "mcd", "ics", "optics" or "lof".

threshold

A list containing the threshold values for each method (e.g. list('mahalanobis' = 7, 'cook' = 1)), above which an observation is considered as outlier. If NULL, default values will be used (see 'Details'). If a numeric value is given, it will be used as the threshold for any of the method run.

ID

Optional, to report an ID column along with the row number.

verbose

Toggle warnings.

type

Type of method to test for outliers. Can be one of "default", "binomial" or "bootstrap". Only applies when x is an object returned by simulate_residuals() or of class DHARMa. See 'Details' in ?DHARMa::testOutliers for a detailed description of the types.

iterations

Number of simulations to run.

alternative

A character string specifying the alternative hypothesis.

Details

Outliers can be defined as particularly influential observations. Most methods rely on the computation of some distance metric, and the observations greater than a certain threshold are considered outliers. Importantly, outliers detection methods are meant to provide information to consider for the researcher, rather than to be an automatized procedure which mindless application is a substitute for thinking.

An example sentence for reporting the usage of the composite method could be:

"Based on a composite outlier score (see the 'check_outliers' function in the 'performance' R package; Lüdecke et al., 2021) obtained via the joint application of multiple outliers detection algorithms (Z-scores, Iglewicz, 1993; Interquartile range (IQR); Mahalanobis distance, Cabana, 2019; Robust Mahalanobis distance, Gnanadesikan and Kettenring, 1972; Minimum Covariance Determinant, Leys et al., 2018; Invariant Coordinate Selection, Archimbaud et al., 2018; OPTICS, Ankerst et al., 1999; Isolation Forest, Liu et al. 2008; and Local Outlier Factor, Breunig et al., 2000), we excluded n participants that were classified as outliers by at least half of the methods used."

Value

A logical vector of the detected outliers with a nice printing method: a check (message) on whether outliers were detected or not. The information on the distance measure and whether or not an observation is considered as outlier can be recovered with the as.data.frame function. Note that the function will (silently) return a vector of FALSE for non-supported data types such as character strings.

Model-specific methods

Univariate methods

Multivariate methods

Methods for simulated residuals

The approach for detecting outliers based on simulated residuals differs from the traditional methods and may not be detecting outliers as expected. Literally, this approach compares observed to simulated values. However, we do not know the deviation of the observed data to the model expectation, and thus, the term "outlier" should be taken with a grain of salt. It refers to "simulation outliers". Basically, the comparison tests whether on observed data point is outside the simulated range. It is strongly recommended to read the related documentations in the DHARMa package, e.g. ?DHARMa::testOutliers.

Threshold specification

Default thresholds are currently specified as follows:

list(
  zscore = stats::qnorm(p = 1 - 0.001 / 2),
  zscore_robust = stats::qnorm(p = 1 - 0.001 / 2),
  iqr = 1.7,
  ci = 1 - 0.001,
  eti = 1 - 0.001,
  hdi = 1 - 0.001,
  bci = 1 - 0.001,
  cook = stats::qf(0.5, ncol(x), nrow(x) - ncol(x)),
  pareto = 0.7,
  mahalanobis = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
  mahalanobis_robust = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
  mcd = stats::qchisq(p = 1 - 0.001, df = ncol(x)),
  ics = 0.001,
  optics = 2 * ncol(x),
  lof = 0.001
)

Meta-analysis models

For meta-analysis models (e.g. objects of class rma from the metafor package or metagen from package meta), studies are defined as outliers when their confidence interval lies outside the confidence interval of the pooled effect.

Note

There is also a plot()-method implemented in the see-package. Please note that the range of the distance-values along the y-axis is re-scaled to range from 0 to 1.

References

See Also

Other functions to check model assumptions and and assess model quality: check_autocorrelation(), check_collinearity(), check_convergence(), check_heteroscedasticity(), check_homogeneity(), check_model(), check_overdispersion(), check_predictions(), check_singularity(), check_zeroinflation()

Examples

data <- mtcars # Size nrow(data) = 32

# For single variables ------------------------------------------------------
outliers_list <- check_outliers(data$mpg) # Find outliers
outliers_list # Show the row index of the outliers
as.numeric(outliers_list) # The object is a binary vector...
filtered_data <- data[!outliers_list, ] # And can be used to filter a dataframe
nrow(filtered_data) # New size, 28 (4 outliers removed)

# Find all observations beyond +/- 2 SD
check_outliers(data$mpg, method = "zscore", threshold = 2)

# For dataframes ------------------------------------------------------
check_outliers(data) # It works the same way on dataframes

# You can also use multiple methods at once
outliers_list <- check_outliers(data, method = c(
  "mahalanobis",
  "iqr",
  "zscore"
))
outliers_list

# Using `as.data.frame()`, we can access more details!
outliers_info <- as.data.frame(outliers_list)
head(outliers_info)
outliers_info$Outlier # Including the probability of being an outlier

# And we can be more stringent in our outliers removal process
filtered_data <- data[outliers_info$Outlier < 0.1, ]

# We can run the function stratified by groups using `{datawizard}` package:
group_iris <- datawizard::data_group(iris, "Species")
check_outliers(group_iris)



# You can also run all the methods
check_outliers(data, method = "all", verbose = FALSE)

# For statistical models ---------------------------------------------
# select only mpg and disp (continuous)
mt1 <- mtcars[, c(1, 3, 4)]
# create some fake outliers and attach outliers to main df
mt2 <- rbind(mt1, data.frame(
  mpg = c(37, 40), disp = c(300, 400),
  hp = c(110, 120)
))
# fit model with outliers
model <- lm(disp ~ mpg + hp, data = mt2)

outliers_list <- check_outliers(model)
plot(outliers_list)

insight::get_data(model)[outliers_list, ] # Show outliers data



[Package performance version 0.11.0 Index]