OutlierPCDist {rrcovHD} | R Documentation |
Outlier identification in high dimensions using the PCDIST algorithm
Description
The function implements a simple, automatic outlier detection method suitable for high dimensional data that treats each class independently and uses a statistically principled threshold for outliers. The algorithm can detect both mislabeled and abnormal samples without reference to other classes.
Usage
OutlierPCDist(x, ...)
## Default S3 method:
OutlierPCDist(x, grouping, control, k, explvar, trace=FALSE, ...)
## S3 method for class 'formula'
OutlierPCDist(formula, data, ..., subset, na.action)
Arguments
formula |
a formula with no response variable, referring only to numeric variables. |
data |
an optional data frame (or similar: see
|
subset |
an optional vector used to select rows (observations) of the
data matrix |
na.action |
a function which indicates what should happen
when the data contain |
... |
arguments passed to or from other methods. |
x |
a matrix or data frame. |
grouping |
grouping variable: a factor specifying the class for each observation. |
control |
a control object (S4) for one of the available control classes,
e.g. |
k |
Number of components to select for PCA. If missing, the number of components will be calculated automatically |
explvar |
Minimal explained variance to be used for calculation of
the number of components in PCA. If |
trace |
whether to print intermediate results. Default is |
Details
If the data set consists of two or more classes
(specified by the grouping variable grouping
) the proposed method iterates
through the classes present in the data, separates each class from the rest and
identifies the outliers relative to this class, thus treating both types of outliers,
the mislabeled and the abnormal samples in a homogenous way.
The first step of the algorithm is dimensionality reduction using (classical) PCA. The number of components to select can be provided by the user but if missing, the number of components will be calculated either using the provided minimal explained variance or by the automatic dimensionality selection using profile likelihood, as proposed by Zhu and Ghodsi.
Value
An S4 object of class OutlierPCDist
which
is a subclass of the virtual class Outlier
.
Author(s)
Valentin Todorov valentin.todorov@chello.at
References
A.D. Shieh and Y.S. Hung (2009). Detecting Outlier Samples in Microarray Data, Statistical Applications in Genetics and Molecular Biology 8.
M. Zhu, and A. Ghodsi (2006). Automatic dimensionality selection from the scree plot via the use of profile likelihood. Computational Statistics & Data Analysis, 51, pp. 918–930.
Filzmoser P & Todorov V (2013). Robust tools for the imperfect world, Information Sciences 245, 4–20. doi:10.1016/j.ins.2012.10.017.
See Also
Examples
data(hemophilia)
obj <- OutlierPCDist(gr~.,data=hemophilia)
obj
getDistance(obj) # returns an array of distances
getClassLabels(obj, 1) # returns an array of indices for a given class
getCutoff(obj) # returns an array of cutoff values (for each class, usually equal)
getFlag(obj) # returns an 0/1 array of flags
plot(obj, class=2) # standard plot function