R: k-means Clustering Using Pivotal Algorithms For Seeding

piv_KMeans {pivmet}

R Documentation

k-means Clustering Using Pivotal Algorithms For Seeding

Description

Perform classical k-means clustering on a data matrix using pivots as initial centers.

Usage

piv_KMeans(
  x,
  centers,
  alg.type = c("kmeans", "hclust"),
  method = "average",
  piv.criterion = c("MUS", "maxsumint", "minsumnoint", "maxsumdiff"),
  H = 1000,
  iter.max = 10,
  nstart = 10,
  prec_par = 10
)

Arguments

`x`	A `N \times D` data matrix, or an object that can be coerced to such a matrix (such as a numeric vector or a dataframe with all numeric columns).
`centers`	The number of groups for the the `k`-means solution.
`alg.type`	The clustering algorithm for the initial partition of the `N` units into the desired number of clusters. Possible choices are `"kmeans"` (default) and `"hclust"`.
`method`	If `alg.type` is `"hclust"`, the character string defining the clustering method. The methods implemented are `"single"`, `"complete"`, `"average"`, `"ward.D"`, `"ward.D2"`, `"mcquitty"`, `"median"`, `"centroid"`. The default is `"average"`.
`piv.criterion`	The pivotal criterion used for identifying one pivot for each group. Possible choices are: `"MUS", "maxsumint", "minsumnoint", "maxsumdiff"`. If `centers <= 4`, the default method is `"MUS"`; otherwise, the default method is `"maxsumint"` (see the details and the vignette).
`H`	The number of distinct `k`-means runs used for building the `N \times N` co-association matrix. Default is 10^3.
`iter.max`	If `alg.type` is `"kmeans"`, the maximum number of iterations to be passed to `kmeans()`. Default is 10.
`nstart`	If `alg.type` is `"kmeans"`, the number of different starting random seeds to be passed to `kmeans()`. Default is 10.
`prec_par`	If `piv.criterion` is `"MUS"`, the maximum number of competing pivots in each group. If groups' sizes are less than the default value, which is 10, then it is set equal to the cardinality of the smallest group in the initial partition.

Details

The function implements a modified version of k-means which aims at improving the clustering solution starting from a careful seeding. In particular, it performs a pivot-based initialization step using pivotal methods to find the initial centers for the clustering procedure. The starting point consists of multiple runs of the classical k-means by selecting nstart>1 in the kmeans function, with a fixed number of clusters in order to build the co-association matrix of data units.

Value

A list with components

`cluster`	A vector of integers indicating the cluster to which each point is allocated.
`centers`	A matrix of cluster centers (centroids).
`coass`	The co-association matrix built from ensemble clustering.
`pivots`	The pivotal units identified by the selected pivotal criterion.
`totss`	The total sum of squares.
`withinss`	The within-cluster sum of squares for each cluster.
`tot.withinss`	The within-cluster sum of squares summed across clusters.
`betwennss`	The between-cluster sum of squared distances.
`size`	The number of points in each cluster.
`iter`	The number of (outer) iterations.
`ifault`	integer: indicator of a possible algorithm problem (for experts).

Author(s)

Leonardo Egidi legidi@units.it, Roberta Pappada

References

Egidi, L., Pappadà, R., Pauli, F., Torelli, N. (2018). K-means seeding via MUS algorithm. Conference Paper, Book of Short Papers, SIS2018, ISBN: 9788891910233.

Examples


# Data generated from a mixture of three bivariate Gaussian distributions

## Not run: 
N  <- 620
k  <- 3
n1 <- 20
n2 <- 100
n3 <- 500
x  <- matrix(NA, N,2)
truegroup <- c( rep(1,n1), rep(2, n2), rep(3, n3))

 x[1:n1,] <- rmvnorm(n1, c(1,5), sigma=diag(2))
 x[(n1+1):(n1+n2),] <- rmvnorm(n2, c(4,0), sigma=diag(2))
 x[(n1+n2+1):(n1+n2+n3),] <- rmvnorm(n3, c(6,6), sigma=diag(2))

# Apply piv_KMeans with MUS as pivotal criterion

res <- piv_KMeans(x, k)

# Apply piv_KMeans with maxsumdiff as pivotal criterion

res2 <- piv_KMeans(x, k, piv.criterion ="maxsumdiff")

# Plot the data and the clustering solution

par(mfrow=c(1,2), pty="s")
colors_cluster <- c("grey", "darkolivegreen3", "coral")
colors_centers <- c("black", "darkgreen", "firebrick")
graphics::plot(x, col = colors_cluster[truegroup],
   bg= colors_cluster[truegroup], pch=21, xlab="x[,1]",
   ylab="x[,2]", cex.lab=1.5,
   main="True data", cex.main=1.5)

graphics::plot(x, col = colors_cluster[res$cluster],
   bg=colors_cluster[res$cluster], pch=21, xlab="x[,1]",
   ylab="x[,2]", cex.lab=1.5,
   main="piv_KMeans", cex.main=1.5)
points(x[res$pivots, 1], x[res$pivots, 2],
      pch=24, col=colors_centers,bg=colors_centers,
      cex=1.5)
points(res$centers, col = colors_centers[1:k],
   pch = 8, cex = 2)

## End(Not run)

[Package pivmet version 0.6.0 Index]