InitClust {otrimle}R Documentation

Robust Initialization for Model-based Clustering Methods

Description

Computes the initial cluster assignment based on a combination of nearest neighbor based noise detection, and agglomerative hierarchical clustering based on maximum likelihood criteria for Gaussian mixture models.

Usage

 InitClust(data , G , k = 3 , knnd.trim = 0.5 , modelName='VVV')
 

Arguments

data

A numeric vector, matrix, or data frame of observations. Rows correspond to observations and columns correspond to variables. Categorical variables and NA values are not allowed.

G

An integer specifying the number of clusters.

k

An integer specifying the number of considered nearest neighbors per point used for the denoising step (see Details).

knnd.trim

A number in [0,1) which defines the proportion of points initialized as noise. Tipically knnd.trim<=0.5 (see Details).

modelName

A character string indicating the covariance model to be used. Possible models are:
"E": equal variance (one-dimensional)
"V" : spherical, variable variance (one-dimensional)
"EII": spherical, equal volume
"VII": spherical, unequal volume
"EEE": ellipsoidal, equal volume, shape, and orientation
"VVV": ellipsoidal, varying volume, shape, and orientation (default).
See Details.

Details

The initialization is based on Coretto and Hennig (2017). First, wwo steps are performed:

Step 1 (denoising step): for each data point compute its kth-nearest neighbors distance (k-NND). All points with k-NND larger than the (1-knnd.trim)-quantile of the k-NND are initialized as noise. Intepretaion of k is that: (k-1), but not k, points close together may still be interpreted as noise or outliers

Step 2 (clustering step): perform the model-based hierarchical clustering (MBHC) proposed in Fraley (1998). This step is performed using hc. The input argument modelName is passed to hc. See Details of hc for more details.

If the previous Step 2 fails to provide G clusters each containing at least 2 distinct data points, it is replaced with classical hirararchical clustering implemented in hclust. Finally, if hclust fails to provide a valid partition, up to ten random partitions are tried.

Value

An integer vector specifying the initial cluster assignment with 0 denoting noise/outliers.

References

Fraley, C. (1998). Algorithms for model-based Gaussian hierarchical clustering. SIAM Journal on Scientific Computing 20:270-281.

P. Coretto and C. Hennig (2017). Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. Journal of Machine Learning Research, Vol. 18(142), pp. 1-39. https://jmlr.org/papers/v18/16-382.html

Author(s)

Pietro Coretto pcoretto@unisa.it https://pietro-coretto.github.io

See Also

hc

Examples

 ## Load  Swiss banknotes data
 data(banknote)
 x <- banknote[,-1]

 ## Initial clusters with default arguments
 init <- InitClust(data = x, G = 2)
 print(init)

 ## Perform otrimle
 a <- otrimle(data = x, G = 2, initial = init,
              logicd = c(-Inf, -50, -10), ncores = 1)
 plot(a, what="clustering", data=x)
 

[Package otrimle version 2.0 Index]