acc_multivariate_outlier {dataquieR} | R Documentation |
Calculate and plot Mahalanobis distances
Description
A standard tool to detect multivariate outliers is the Mahalanobis distance. This approach is very helpful for the interpretation of the plausibility of a measurement given the value of another. In this approach the Mahalanobis distance is used as a univariate measure itself. We apply the same rules for the identification of outliers as in univariate outliers:
the classical approach from Tukey:
1.5 * IQR
from the 1st (Q_{25}
) or 3rd (Q_{75}
) quartile.the 3SD approach, i.e. any measurement of the Mahalanobis distance not in the interval of
\bar{x} \pm 3*\sigma
is considered an outlier.the approach from Hubert for skewed distributions which is embedded in the R package robustbase
a completely heuristic approach named
\sigma
-gap.
For further details, please see the vignette for univariate outlier.
Usage
acc_multivariate_outlier(
variable_group = NULL,
id_vars = NULL,
label_col,
n_rules = 4,
max_non_outliers_plot = 10000,
criteria = c("tukey", "3sd", "hubert", "sigmagap"),
study_data,
meta_data
)
Arguments
variable_group |
variable list the names of the continuous measurement variables building a group, for that multivariate outliers make sense. |
id_vars |
variable optional, an ID variable of the study data. If not specified row numbers are used. |
label_col |
variable attribute the name of the column in the metadata with labels of variables |
n_rules |
numeric from=1 to=4. the no. of rules that must be violated to classify as outlier |
max_non_outliers_plot |
integer from=0. Maximum number of non-outlier points to be plot. If more points exist, a subsample will be plotted only. Note, that sampling is not deterministic. |
criteria |
set tukey | 3SD | hubert | sigmagap. a vector with methods to be used for detecting outliers. |
study_data |
data.frame the data frame that contains the measurements |
meta_data |
data.frame the data frame that contains metadata attributes of study data |
Value
a list with:
-
SummaryTable
: data.frame underlying the plot -
SummaryPlot
: ggplot2 outlier plot -
FlaggedStudyData
data.frame contains the original data frame with the additional columnstukey
,3SD
,hubert
, andsigmagap
. Every observation is coded 0 if no outlier was detected in the respective column and 1 if an outlier was detected. This can be used to exclude observations with outliers.
ALGORITHM OF THIS IMPLEMENTATION:
Implementation is restricted to variables of type float
Remove missing codes from the study data (if defined in the metadata)
The covariance matrix is estimated for all variables from
variable_group
The Mahalanobis distance of each observation is calculated
MD^2_i = (x_i - \mu)^T \Sigma^{-1} (x_i - \mu)
The four rules mentioned above are applied on this distance for each observation in the study data
An output data frame is generated that flags each outlier
A parallel coordinate plot indicates respective outliers
List function.