R: Heuristics to find subpopulations of outliers

ClusterFinder1 {compositions}

R Documentation

Heuristics to find subpopulations of outliers

Description

The ClusterFinder is a heuristic to find subpopulations of outliers essentially by looking for secondary modes in a density estimate.

Usage

ClusterFinder1(X,...)
## S3 method for class 'acomp'
ClusterFinder1(X,...,sigma=0.3,radius=1,asig=1,minGrp=3,
                                 robust=TRUE)

Arguments

`X`	the dataset to be clustered
`...`	Further arguments to `MahalanobisDist(X,...,robust=robust,pairwise=TRUE)`
`sigma`	numeric: The Bandwidth of the density estimation kernel in a robustly Mahalanobis transformed space. (i.e. in the transform, where the main group has unit variance)
`radius`	The minimum size of a cluster in a robustly Mahalanobis transformed space. (i.e. in the transform, where the main group has unit variance)
`asig`	a scaling factor for the geometry of the robustly Mahalanobis transformed space when computing the likelihood of an observation to belong to group (under a Gaussian assumption). Higher values
`minGrp`	the minimum size of group to be used. Smaller groups are treated as single outliers
`robust`	A robustness description for estimating the variance of the main group. FALSE is probably not a usefull value. However later other robustness techniques than mcd might be usefull. `TRUE` just picks the default method of the package.

Details

See outliersInCompositions for a comprehensive introduction into the outlier treatment in compositions.
The ClusterFinder is labeled with a number to make clear that this is just an implementation of some heuristic and not based on some eternal truth. Other might give better Clusterfinders.
Unlike other Clustering Algorithms the basic model of this algorithm assumes that there is one dominating subpopulation and an unkown number of smaller subpopulations with a similar covariance structure but a different mean. The algorithm thus first estimates the covariance structure of the main population by a robust location scale estimator. Then it uses a simplified (Gaussian) kernel density estimator to find nonrandom secondary modes. The it tries to a assign the different observations according to discrimination analysis model to the different modes. Groups under a given size are considered as single outliers forming a seperate group. In this way the number of clusters is kept low even if there are many erratic measurements in the dataset.
The main use of the clusters is descriptive plotting. The advantage of these cluster against other cluster techniques like k-mean or hclust is that it does not tear appart the central mass of the data, as these methods do to make the clusters as compact as possible.

Value

A list

`types`	a factor representing the group assignments, when the small groups are ignored
`typesTbl`	a table giving the number of members in each of these groups
`groups`	a factor representing the found group assignments
`isMax`	a logical vector indicating for each observation,whether it represent a local maximum in the density estimate.
`prob`	the infered probability to belong to the different groups given as an acomp composition.
`nmembers`	a tabel giving the number of members of each group
`density`	the density estimated in each observation location
`likeli`	The infered likelihood see this observation, for each of the groups

Author(s)

K.Gerald v.d. Boogaart http://www.stat.boogaart.de

Examples

data(SimulatedAmounts)
  cl <- ClusterFinder1(sa.outliers5,sigma=0.4,radius=1) 
  plot(sa.outliers5,col=as.numeric(cl$types),pch=as.numeric(cl$types))
  legend(1,1,legend=levels(cl$types),xjust=1,col=1:length(levels(cl$types)),
                     pch=1:length(levels(cl$types)))

[Package compositions version 2.0-8 Index]