R: Hard C-Means Clustering

hcm {ppclust}

R Documentation

Hard C-Means Clustering

Description

Partitions a numeric data set by using Hard C-Means (HCM) clustering algorithm (or K-Means) which has been proposed by MacQueen(1967). The function hcm is an extension of the basic kmeans with more input arguments and output values in order to make the clustering results comparable with those of other fuzzy and possibilistic algorithms. For instance, not only the Euclidean distance metric but also a number of distance metrics such as the squared Euclidean distance, the squared Chord distance etc. can be employed with the function hcm.

Usage

hcm(x, centers, dmetric="euclidean", pw=2, alginitv="kmpp",  
   nstart=1, iter.max=1000, con.val=1e-9, stand=FALSE, numseed)

Arguments

`x`	a numeric vector, data frame or matrix.
`centers`	an integer specifying the number of clusters or a numeric matrix containing the initial cluster centers.
`dmetric`	a string for the distance metric. The default is euclidean for the squared Euclidean distances. See `get.dmetrics` for the alternative options.
`pw`	a number for the power of Minkowski distance calculation. The default is 2 if the `dmetric` is minkowski.
`alginitv`	a string for the initialization of cluster prototypes matrix. The default is kmpp for K-means++ initialization method (Arthur & Vassilvitskii, 2007). For the list of alternative options see `get.algorithms`.
`nstart`	an integer for the number of starts for clustering. The default is 1.
`iter.max`	an integer for the maximum number of iterations allowed. The default is 1000.
`con.val`	a number for the convergence value between the iterations. The default is 1e-09.
`stand`	a logical flag to standardize data. Its default value is `FALSE`. If its value is `TRUE`, the data matrix `x` is standardized.
`numseed`	a seeding number to set the seed of R's random number generator.

Details

Hard C-Means (HCM) clustering algorithm (or K-means) partitions a data set into k groups, so-called clusters. The objective function of HCM is:

J_{HCM}(\mathbf{X}; \mathbf{V})=\sum\limits_{i=1}^n d^2(\vec{x}_i, \vec{v}_j)

See ppclust-package for the details about the terms in the above equation of J_{HCM}.

The update equation for membership degrees is:

u_{ij} = \left\{ \begin{array}{rl} 1 & if \; d^2(\vec{x}_i, \vec{v}_j) = min_{1\leq l\leq k} \; (d^2(\vec{x}_i, \vec{v}_l)) \\ 0 & otherwise \end{array} \right.

The update equation for cluster prototypes is:

\vec{v}_{j} =\frac{\sum\limits_{i=1}^n u_{ij} \vec{x}_i}{\sum\limits_{i=1}^n u_{ij}} \;\;; {1\leq j\leq k}

Value

an object of class ‘ppclust’, which is a list consists of the following items:

`x`	a numeric matrix containing the processed data set.
`v`	a numeric matrix containing the final cluster prototypes (centers of clusters).
`u`	a numeric matrix containing the hard membership degrees of the data objects.
`d`	a numeric matrix containing the distances of objects to the final cluster prototypes.
`k`	an integer for the number of clusters.
`cluster`	a numeric vector containing the cluster labels of the data objects.
`csize`	a numeric vector containing the number of objects in the clusters.
`best.start`	an integer for the index of start with the minimum objective functional.
`iter`	an integer vector for the number of iterations in each start of the algorithm.
`func.val`	a numeric vector for the objective function values of each start of the algorithm.
`comp.time`	a numeric vector for the execution time of each start of the algorithm.
`wss`	a numeric vector containing the within-cluster sum of squares for each cluster.
`bwss`	a number for the between-cluster sum of squares.
`tss`	a number for the total within-cluster sum of squares.
`twss`	a number for the total sum of squares.
`stand`	a logical value, `TRUE` shows that `x` data set contains the standardized values of raw data.
`algorithm`	a string for the name of partitioning algorithm. It is ‘HCM’ with this function.
`call`	a string for the matched function call generating this ‘ppclust’ object.

Author(s)

Zeynel Cebeci & Figen Yildiz

References

Arthur, D. & Vassilvitskii, S. (2007). K-means++: The advantages of careful seeding, in Proc. of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, p. 1027-1035. <http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf>

MacQueen, J.B. (1967). Some methods for classification and analysis of multivariate observations. In Proc. of 5th Berkeley Symp. on Mathematical Statistics and Probability, Berkeley, Univ. of California Press, 1: 281-297. <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.308.8619&rep=rep1&type=pdf>

Examples

## Not run: 
# Load dataset iris 
data(iris)
x <- iris[,-5]

# Initialize the prototype matrix using K-means++
v <- inaparc::kmpp(x, k=3)$v

# Run HCM with the initial prototypes
res.hcm <- hcm(x, centers=v)

# Print, summarize and plot the clustering result
res.hcm$cluster
summary(res.hcm$cluster)
plot(x, col=res.hcm$cluster, pch=16)

## End(Not run)

[Package ppclust version 1.1.0.1 Index]