R: Clustering row/column categories on the basis of...

caCluster {CAinterprTools}

R Documentation

Clustering row/column categories on the basis of Correspondence Analysis coordinates from a space of user-defined dimensionality.

Description

This function plots the result of cluster analysis performed on the results of Correspondence Analysis, providing the facility to produce a dendrogram, a silhouette plot depicting the "quality" of the clustering solution, and a scatterplot with points coded according to the cluster membership.

Usage

caCluster(
  data,
  which = "both",
  dim = NULL,
  dist.meth = "euclidean",
  aggl.meth = "ward.D2",
  opt.part = FALSE,
  opt.part.meth = "mean",
  part = NULL,
  cex.dndr.lab = 0.85,
  cex.sil.lab = 0.75,
  cex.sctpl.lab = 3.5
)

Arguments

`data`	Contingency table (dataframe format).
`which`	Takes "both" to cluster both row and column categories; "rows" or "columns" to cluster only row or column categories respectively
`dim`	Sets the dimensionality of the space whose coordinates are used to cluster the CA categories; it can be an integer or a vector (e.g., c(2,3)) specifying the first and second selected dimension. NULL is the default; it will make the clustering to be based on the maximum dimensionality of the dataset.
`dist.meth`	Sets the distance method used for the calculation of the distance between categories; "euclidean" is the default (see the help of the help if the dist() function for more info and other methods available).
`aggl.meth`	Sets the agglomerative method to be used in the dendrogram construction; "ward.D2" is the default (see the help of the hclust() function for more info and for other methods available).
`opt.part`	Takes TRUE or FALSE (default) if the user wants or doesn't want an optimal partition to be suggested; the latter is based upon an iterative process that seek for the maximization of the average silhouette width.
`opt.part.meth`	Sets whether the optimal partition method will try to maximize the average ("mean") or median ("median") silhouette width. The former is the default.
`part`	Integer which sets the number of desired clusters (NULL is default); this will override the optimal cluster solution.
`cex.dndr.lab`	Sets the size of the dendrogram's labels. 0.85 is the default.
`cex.sil.lab`	Sets the size of the silhouette plot's s labels. 0.75 is the default.
`cex.sctpl.lab`	Sets the size of the Correspondence Analysis scatterplot's labels. 3.5 is the default.

Details

The function provides the facility to perform hierarchical cluster analysis of row and/or column categories on the basis of Correspondence Analysis result. The clustering is based on the row and/or colum categories' coordinates from:
(1) a high-dimensional space corresponding to the whole dimensionality of the input contingency table;
(2) a high-dimensional space of dimensionality smaller than the full dimensionality of the input dataset;
(3) a bi-dimensional space defined by a pair of user-defined dimensions.
To obtain (1), the 'dim' parameter must be left in its default value (NULL);
To obtain (2), the 'dim' parameter must be given an integer (needless to say, smaller than the full dimensionality of the input data);
To obtain (3), the 'dim' parameter must be given a vector (e.g., c(1,3)) specifying the dimensions the user is interested in.

The method by which the distance is calculated is specified using the 'dist.meth' parameter, while the agglomerative method is specified using the 'aggl.meth' parameter. By default, they are set to "euclidean" and "ward.D2" respectively.

The user may want to specify beforehand the desired number of clusters (i.e., the cluster solution). This is accomplished feeding an integer into the 'part' parameter. A dendrogram (with rectangles indicating the clustering solution), a silhouette plot (indicating the "quality" of the cluster solution), and a CA scatterplot (with points given colours on the basis of their cluster membership) are returned. Please note that, when a high-dimensional space is selected, the scatterplot will use the first 2 CA dimensions; the user must keep in mind that the clustering based on a higher-dimensional space may not be well reflected on the subspace defined by the first two dimensions only.
Also note:
-if both row and column categories are subject to the clustering, the column categories will be flagged by an asterisk (*) in the dendrogram (and in the silhouette plot) just to make it easier to identify rows and columns;
-the silhouette plot displays the average silhouette width as a dashed vertical line; the dimensionality of the CA space used is reported in the plot's title; if a pair of dimensions has been used, the individual dimensions are reported in the plot's title;
-the silhouette plot's labels end with a number indicating the cluster to which each category is closer.

An optimal clustering solution can be obtained setting the 'opt.part' parameter to TRUE. The optimal partition is selected by means of an iterative routine which locates at which cluster solution the highest average silhouette width is achieved. If the 'opt.part' parameter is set to TRUE, an additional plot is returned along with the silhouette plot. It displays a scatterplot in which the cluster solution (x-axis) is plotted against the average silhouette width (y-axis). A vertical reference line indicate the cluster solution which maximize the silhouette width, corresponding to the suggested optimal partition.

The function returns a list storing information about the cluster membership (i.e., which categories belong to which cluster).

Further info and Disclaimer:
The silhouette plot is obtained from the silhouette() function out from the 'cluster' package (https://cran.r-project.org/web/packages/cluster/index.html). For a detailed description of the silhouette plot, its rationale, and its interpretation, see:
-Rousseeuw P J. 1987. "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis", Journal of Computational and Applied Mathematics 20, 53-65 (http://www.sciencedirect.com/science/article/pii/0377042787901257)

For the idea of clustering categories on the basis of the CA coordinates from a full high-dimensional space (or from a subset thereof), see:
-Ciampi et al. 2005. "Correspondence analysis and two-way clustering", SORT 29 (1), 27-4
-Beh et al. 2011. "A European perception of food using two methods of correspondence analysis", Food Quality and Preference 22(2), 226-231

Please note that the interpretation of the clustering when both row AND column categories are used must proceed with caution due to the issue of inter-class points' distance interpretation. For a full description of the issue (also with further references), see:
-Greenacre M. 2007. "Correspondence Analysis in Practice", Boca Raton-London-New York, Chapman&Hall/CRC, 267-268.

Examples

data(brand_coffee)

#displays a dendrogram of row AND column categories
res <- caCluster(brand_coffee, opt.part=FALSE)

#displays a dendrogram for row AND column categories; the clustering is based on the CA 
#coordinates from a full high-dimensional space. Rectangles indicating the clusters defined by 
#the optimal partition method (see Details). A silhouette plot, a scatterplot, and a CA 
#scatterplot with indication of cluster membership are also produced (see Details). 
#The cluster membership is stored in the object 'res'.

res <- caCluster(brand_coffee, opt.part=TRUE)

#displays a dendrogram for row categories, with rectangles indicating the clusters defined by the 
#optimal partition method (see Details). The clustering is based on a space of dimensionality 4. 
#A silhouette plot, a scatterplot, and a CA scatterplot with indication of cluster membership are 
#also produced (see Details). The cluster membership is stored in the object 'res'.

res <- caCluster(brand_coffee, which="rows", dim=4, opt.part=TRUE)

#like the above example, but the clustering is based on the coordinates on the sub-space defined 
#by a pair of dimensions (i.e., 1 and 4).

res <- caCluster(brand_coffee, which="rows", dim=c(1,4), opt.part=TRUE)