ClusterFinder1 {compositions} | R Documentation |
Heuristics to find subpopulations of outliers
Description
The ClusterFinder is a heuristic to find subpopulations of outliers essentially by looking for secondary modes in a density estimate.
Usage
ClusterFinder1(X,...)
## S3 method for class 'acomp'
ClusterFinder1(X,...,sigma=0.3,radius=1,asig=1,minGrp=3,
robust=TRUE)
Arguments
X |
the dataset to be clustered |
... |
Further arguments to |
sigma |
numeric: The Bandwidth of the density estimation kernel in a robustly Mahalanobis transformed space. (i.e. in the transform, where the main group has unit variance) |
radius |
The minimum size of a cluster in a robustly Mahalanobis transformed space. (i.e. in the transform, where the main group has unit variance) |
asig |
a scaling factor for the geometry of the robustly Mahalanobis transformed space when computing the likelihood of an observation to belong to group (under a Gaussian assumption). Higher values |
minGrp |
the minimum size of group to be used. Smaller groups are treated as single outliers |
robust |
A robustness description for estimating the variance of
the main group. FALSE is probably not a usefull value. However later
other robustness techniques than mcd might be usefull. |
Details
See outliersInCompositions for a comprehensive introduction
into the outlier
treatment in compositions.
The ClusterFinder is labeled with a number to make clear that this is
just an implementation of some heuristic and not based on some eternal
truth. Other might give better Clusterfinders.
Unlike other Clustering Algorithms the basic model of this
algorithm assumes that there is one dominating subpopulation and an
unkown number of smaller subpopulations with a similar covariance
structure but a different mean. The algorithm thus first estimates the
covariance structure of the main population by a robust location scale
estimator. Then it uses a simplified (Gaussian) kernel density
estimator to find
nonrandom secondary modes. The it tries to a assign the different
observations according to discrimination analysis model to the
different modes. Groups under a given size are considered as single
outliers forming a seperate group. In this way the number of clusters
is kept low even if there are many erratic measurements in the dataset.
The main use of the
clusters is descriptive plotting. The advantage of these cluster
against other cluster techniques like k-mean or hclust is that it does
not tear appart the central mass of the data, as these methods do to
make the clusters as compact as possible.
Value
A list
types |
a factor representing the group assignments, when the small groups are ignored |
typesTbl |
a table giving the number of members in each of these groups |
groups |
a factor representing the found group assignments |
isMax |
a logical vector indicating for each observation,whether it represent a local maximum in the density estimate. |
prob |
the infered probability to belong to the different groups given as an acomp composition. |
nmembers |
a tabel giving the number of members of each group |
density |
the density estimated in each observation location |
likeli |
The infered likelihood see this observation, for each of the groups |
Author(s)
K.Gerald v.d. Boogaart http://www.stat.boogaart.de
See Also
Examples
data(SimulatedAmounts)
cl <- ClusterFinder1(sa.outliers5,sigma=0.4,radius=1)
plot(sa.outliers5,col=as.numeric(cl$types),pch=as.numeric(cl$types))
legend(1,1,legend=levels(cl$types),xjust=1,col=1:length(levels(cl$types)),
pch=1:length(levels(cl$types)))