R: Identifies individual multivariate outliers

ind_multi {bulkQC}

R Documentation

Identifies individual multivariate outliers

Description

Discovers potential individual multivariate outliers by identifying and returning those observations with outlier score greater than a threshold. The outlier score is calculated using single isolation forests.

Usage

ind_multi(d0, exclude = c("pid", "site"), thresh = 0.7, n_uniq = 10)

Arguments

`d0`	A data frame with columns as variables and rows as observations
`exclude`	A vector of names of variables to exclude in outlier identification
`thresh`	Threshold (0-1) that an outlier score must exceed to be flagged for further investigation
`n_uniq`	Number of unique observations of a variable needed for outlier identification to be performed

Details

The function evaluates multivariate observations from each row consisting of those variables not excluded by the 'exclude' argument above. For each multivariate observation, an outlier score is calculated using single isolation forests. Those multivariate observations that are isolated earliest in a decision tree have a lower tree depth, in turn have higher outlier scores, and are thought more likely to be outliers.

Value

`nID`	The number of observations evaluated
`nVar`	The number of variables evaluated
`data`	A data frame containing those observations deemed to be potential outliers that appends the outliers with the excluded variables to aid in interpretation, and includes an outlier score for each row

References

Cortes D. Explainable outlier detection through decision tree conditioning. arXiv:200100636 [cs, stat] [Internet]. 2020 Jan 2 [cited 2021 Nov 12]; Available from: http://arxiv.org/abs/2001.00636

Examples

data(iris)
iris2 = iris
iris2$pid = 1:dim(iris2)[1]
ind_multi(iris2, exclude=c("pid", "Species"), thresh=0.7, n_uniq=10)
ind_multi(iris2, exclude=c("pid", "Species"), thresh=0.6, n_uniq=10)

[Package bulkQC version 1.1 Index]