O3prep {OutliersO3} | R Documentation |
Identify outliers for different combinations of variables
Description
Check the dataset and parameters prior to analysis. Identify outliers for the variable combinations and methods/tolerance levels specified. Prepare input for the two plotting functions O3plotT
and O3plotM
.
Usage
O3prep(data, k1=1, K=ncol(data), method="HDo", tols=0.05, boxplotLimits=c(6, 10, 12),
tolHDo=0.05, tolPCS=0.01, tolBAC=0.001, toladj=0.05, tolDDC=0.01, tolMCD=0.000001)
Arguments
data |
dataset to be checked for outliers |
k1 |
lowest number of variables in a combination |
K |
highest number of variables in a combination |
method |
method(s) used for identifying outliers (up to six can be used) |
tols |
outlier tolerance level(s) when only one method is specified. Up to three can be used. For consistent use of the argument, it is transformed for some of the methods. See details below of how the argument is applied for each approach. |
boxplotLimits |
up to three boxplot limits are used (matching the number of tolerance levels), if a method does not apply for finding outliers for a single variable. |
tolHDo |
an individual outlier tolerance level for the HDoutliers method. The default in HDoutliers, alpha, is 0.05. |
tolPCS |
an individual outlier tolerance level for the FastPCS method. This equals (1-alpha) for the argument in FastPCS, where the default is 0.5. |
tolBAC |
an individual outlier tolerance level for the mvBACON method. The default for alpha in robustX is 0.95. This seems high, but it is divided by n, the dataset size. |
toladj |
an individual outlier tolerance level for the adjOutlyingness method. This equals (1-alpha.cutoff) for the argument in robustbase, where the default is 0.75. |
tolDDC |
an individual outlier tolerance level for the DDC method. This equals (1-tolProb) for the argument in cellWise, where the default is 0.99. |
tolMCD |
an individual outlier tolerance level for the covMcd method. The default is 0.025 (based on the help page for plot.mcd in robustbase). This is NOT the alpha argument in |
Details
To check outliers for all possible combinations of variables choose k1=1 and K=number of variables in the dataset (the default).
The optional methods are "HDo" HDoutliers
(from HDoutliers), "PCS" FastPCS
(FastPCS), "BAC" mvBACON
(robustX), "adjOut" adjOutlyingness
(robustbase), "DDC" DDC
(Cellwise), "MCD" covMcd
(robustbase). References for all these methods can be found on their help pages, linked below. (Note that Cellwise has renamed its function DetectDeviatingCells
. Since version 2.1.0 DDC
is used instead.)
If only one method is specified, then up to three tolerance levels (tols) and three boxplot limits (boxplotLimits) can be specified. If more than one method is specified, then the individual tol* parameters are used.
tol
is the argument determining outlyingness and should be set low, as in HDoutliers
and mvBACON
, where it is called alpha, and in covMcd
. For the other methods (1-tol)
is used. In DDC
the argument is called tolProb. Using the same tolerance level for all methods does not make them directly comparable, which is why it is recommended to set them individually when drawing a comparative O3 plot. The defaults suggested on the methods' help pages mostly found too many outliers and so other defaults have been set. Users need to decide for themselves, possibly dependent on the dataset they are analysing.
Methods "HDo", "mvBACON", "adjOut", and "MCD" can analyse single variables. For the other methods boxplot limits are used for single variables and any case > (Q3 + boxplotLimit*IQR) or < (Q1 - boxplotLimit*IQR) is classed an outlier, where boxplotLimit
is the limit specified.
Value
data |
the dataset analysed |
nw |
the number of variable combinations analysed |
mm |
the outlier methods used |
tols |
the individual tolerance levels for the outlier methods used (if more than one), otherwise up to 3 tolerance levels used for one method |
outList |
a list for each method/tolerance level, and within that for each variable combination, of the variables used, the indices of cases identified as outliers, and the outlier distances for all cases in the dataset. |
Author(s)
Antony Unwin unwin@math.uni-augsburg.de
See Also
HDoutliers
in HDoutliers, FastPCS
in FastPCS, mvBACON
in robustX, adjOutlyingness
in robustbase, DDC
in cellWise, covMcd
in robustbase
Examples
a0 <- O3prep(stackloss, method="PCS", tols=0.05, boxplotLimits=3)
b0 <- O3prep(stackloss, method=c("BAC", "adjOut"), k1=2, tols=0.01, boxplotLimits=6)
## Not run:
a1 <- O3prep(stackloss, method="PCS", tols=c(0.1, 0.05, 0.01), boxplotLimits=c(3, 6, 10))
b1 <- O3prep(stackloss, method=c("HDo", "BAC", "DDC"), tolHDo=0.025, tolBAC=0.01, tolDDC=0.05)
## End(Not run)