acc_multivariate_outlier {dataquieR} | R Documentation |
Calculate and plot Mahalanobis distances
Description
A standard tool to detect multivariate outliers is the Mahalanobis distance. This approach is very helpful for the interpretation of the plausibility of a measurement given the value of another. In this approach the Mahalanobis distance is used as a univariate measure itself. We apply the same rules for the identification of outliers as in univariate outliers:
the classical approach from Tukey:
from the 1st (
) or 3rd (
) quartile.
the 3SD approach, i.e. any measurement of the Mahalanobis distance not in the interval of
is considered an outlier.
the approach from Hubert for skewed distributions which is embedded in the R package robustbase
a completely heuristic approach named
-gap.
For further details, please see the vignette for univariate outlier.
Usage
acc_multivariate_outlier(
variable_group = NULL,
id_vars = NULL,
label_col,
n_rules = 4,
max_non_outliers_plot = 10000,
criteria = c("tukey", "3sd", "hubert", "sigmagap"),
study_data,
meta_data
)
Arguments
variable_group |
variable list the names of the continuous measurement variables building a group, for that multivariate outliers make sense. |
id_vars |
variable optional, an ID variable of the study data. If not specified row numbers are used. |
label_col |
variable attribute the name of the column in the metadata with labels of variables |
n_rules |
numeric from=1 to=4. the no. of rules that must be violated to classify as outlier |
max_non_outliers_plot |
integer from=0. Maximum number of non-outlier points to be plot. If more points exist, a subsample will be plotted only. Note, that sampling is not deterministic. |
criteria |
set tukey | 3SD | hubert | sigmagap. a vector with methods to be used for detecting outliers. |
study_data |
data.frame the data frame that contains the measurements |
meta_data |
data.frame the data frame that contains metadata attributes of study data |
Value
a list with:
-
SummaryTable
: data.frame underlying the plot -
SummaryPlot
: ggplot2 outlier plot -
FlaggedStudyData
data.frame contains the original data frame with the additional columnstukey
,3SD
,hubert
, andsigmagap
. Every observation is coded 0 if no outlier was detected in the respective column and 1 if an outlier was detected. This can be used to exclude observations with outliers.
ALGORITHM OF THIS IMPLEMENTATION:
Implementation is restricted to variables of type float
Remove missing codes from the study data (if defined in the metadata)
The covariance matrix is estimated for all variables from
variable_group
The Mahalanobis distance of each observation is calculated
The four rules mentioned above are applied on this distance for each observation in the study data
An output data frame is generated that flags each outlier
A parallel coordinate plot indicates respective outliers
List function.