R: Summarise MCMC samples of clustering labels with a similarity...

Zsimilarity {IMIFA}

R Documentation

Summarise MCMC samples of clustering labels with a similarity matrix and find the 'average' clustering

Description

This function takes a Monte Carlo sample of cluster labels, computes an average similarity matrix and returns the clustering with minimum mean squared error to this average. The mcclust package must be loaded.

Usage

Zsimilarity(zs)

Arguments

`zs`	A matrix containing samples of clustering labels where the columns correspond to the number of observations (N) and the rows correspond to the number of iterations (M).

Details

This function takes a Monte Carlo sample of cluster labels, converts them to adjacency matrices, and computes a similarity matrix as an average of the adjacency matrices. The dimension of the similarity matrix is invariant to label switching and the number of clusters in each sample, desirable features when summarising partitions of Bayesian nonparametric models such as IMIFA. As a summary of the posterior clustering, the clustering with minimum mean squared error to this 'average' clustering is reported.

A heatmap of z.sim may provide a useful visualisation, if appropriately ordered. The user is also invited to perform hierarchical clustering using hclust after first converting this similarity matrix to a distance matrix - "complete" linkage is recommended. Alternatively, hc could be used.

Value

A list containing three elements:

`z.avg`	The 'average' clustering, with minimum squared distance to `z.sim`.
`z.sim`	The N x N similarity matrix, in a sparse format (see `simple_triplet_matrix`).
`MSE.z`	A vector of length M recording the MSEs between each clustering and the 'average' clustering.

Note

The mcclust package must be loaded.

This is liable to take quite some time to run, especially if the number of observations &/or number of iterations is large. Depending on how distinct the clusters are, z.sim may be stored better in a non-sparse format. This function can optionally be called inside get_IMIFA_results.

Author(s)

Keefe Murphy - <keefe.murphy@mu.ie>

References

Carmona, C., Nieto-barajas, L. and Canale, A. (2018) Model based approach for household clustering with mixed scale variables. Advances in Data Analysis and Classification, 13(2): 559-583.

Examples

# Run a IMIFA model and extract the sampled cluster labels
# data(olive)
# sim    <- mcmc_IMIFA(olive, method="IMIFA", n.iters=5000)
# zs     <- sim[[1]][[1]]$z.store

# Get the similarity matrix and visualise it
# zsimil <- Zsimilarity(zs)
# z.sim  <- as.matrix(zsimil$z.sim)
# z.col  <- mat2cols(z.sim, cols=heat.colors(30, rev=TRUE))
# z.col[z.sim == 0] <- NA
# plot_cols(z.col, na.col=par()$bg); box(lwd=2)

# Extract the clustering with minimum squared distance to this
# 'average' and evaluate its performance against the true labels
# table(zsimil$z.avg, olive$area)

# Perform hierarchical clustering on the distance matrix
# Hcl    <- hclust(as.dist(1 - z.sim), method="complete")
# plot(Hcl)
# table(cutree(Hcl, k=3), olive$area)

[Package IMIFA version 2.2.0 Index]