ind_multi {bulkQC}R Documentation

Identifies individual multivariate outliers

Description

Discovers potential individual multivariate outliers by identifying and returning those observations with outlier score greater than a threshold. The outlier score is calculated using single isolation forests.

Usage

ind_multi(d0, exclude = c("pid", "site"), thresh = 0.7, n_uniq = 10)

Arguments

d0

A data frame with columns as variables and rows as observations

exclude

A vector of names of variables to exclude in outlier identification

thresh

Threshold (0-1) that an outlier score must exceed to be flagged for further investigation

n_uniq

Number of unique observations of a variable needed for outlier identification to be performed

Details

The function evaluates multivariate observations from each row consisting of those variables not excluded by the 'exclude' argument above. For each multivariate observation, an outlier score is calculated using single isolation forests. Those multivariate observations that are isolated earliest in a decision tree have a lower tree depth, in turn have higher outlier scores, and are thought more likely to be outliers.

Value

nID

The number of observations evaluated

nVar

The number of variables evaluated

data

A data frame containing those observations deemed to be potential outliers that appends the outliers with the excluded variables to aid in interpretation, and includes an outlier score for each row

References

Cortes D. Explainable outlier detection through decision tree conditioning. arXiv:200100636 [cs, stat] [Internet]. 2020 Jan 2 [cited 2021 Nov 12]; Available from: http://arxiv.org/abs/2001.00636

Examples

data(iris)
iris2 = iris
iris2$pid = 1:dim(iris2)[1]
ind_multi(iris2, exclude=c("pid", "Species"), thresh=0.7, n_uniq=10)
ind_multi(iris2, exclude=c("pid", "Species"), thresh=0.6, n_uniq=10)

[Package bulkQC version 1.1 Index]