make_clusters {EvoPhylo} | R Documentation |

## Estimate and plot character partitions

### Description

Determines cluster (partition) membership for phylogenetic morphological characters from the supplied Gower distance matrix and requested number of clusters using partitioning around medoids (PAM, or K-medoids). For further and independently testing the quality of the chosen partitioning scheme, users may also poduce graphic clustering (tSNEs), coloring data points according to PAM clusters, to verify PAM clustering results.

### Usage

```
make_clusters(dist_mat, k, tsne = FALSE,
tsne_dim = 2, tsne_theta = 0,
...)
## S3 method for class 'cluster_df'
plot(x, seed = NA, nrow = 1,
...)
```

### Arguments

`dist_mat` |
A Gower distance matrix, the output of a call to |

`k` |
The desired number of clusters (or character partitions), the output from |

`tsne` |
Whether to perform Barnes-Hut t-distributed stochastic neighbor embedding (tSNE) to produce a multi-dimensional representation of the distance matrix using |

`tsne_dim` |
When |

`tsne_theta` |
When |

`...` |
For For |

`x` |
For |

`seed` |
For |

`nrow` |
For |

### Details

`make_clusters`

calls `cluster::pam`

on the supplied Gower distance matrix with the specified number of clusters to determine cluster membership for each character. PAM is analogous to K-means, but it has its clusters centered around medoids instead of centered around centroids, which are less prone to the impact from outliers and heterogeneous cluster sizes. PAM also has the advantage over k-means of utilizing Gower distance matrices instead of Euclidean distance matrices only.

When `tsne = TRUE`

, a Barnes-Hut t-distributed stochastic neighbor embedding is used to compute a multi-dimensional embedding of the distance matrix, coloring data points according to the PAM-defined clusters, as estimated by the function `make_clusters`

. This graphic clustering allows users to independently test the quality of the chosen partitioning scheme from PAM, and can help in visualizing the resulting clusters. `Rtsne::Rtsne`

is used to do this. The resulting dimensions will be included in the output; see Value below.

`plot()`

plots all morphological characters in a scatterplot with points colored based on cluster membership. When `tsne = TRUE`

in the call to `make_clusters()`

, the x- and y-axes will correspond to requested tSNE dimensions. With more than 2 dimensions, several plots will be produced, one for each pair of tSNE dimensions. These are displayed together using `patchwork::plot_layout`

. When `tsne = FALSE`

, the points will be arrange horizontally by cluster membership and randomly placed vertically.

### Value

A data frame, inheriting from class `"cluster_df"`

, with a row for each character with its number (`character_number`

) and cluster membership (`cluster`

). When `tsne = TRUE`

, additional columns will be included, one for each requested tSNE dimension, labeled `tSNE_Dim1`

, `tSNE_Dim2`

, etc., containing the values on the dimensions computed using `Rtsne()`

.

The `pam`

fit resulting from `cluster::pam`

is returned in the `"pam.fit"`

attribute of the outut object.

### Note

When using `plot()`

on a `cluster_df`

object, warnings may appear from `ggrepel`

saying something along the lines of "unlabeled data points (too many overlaps). Consider increasing max.overlaps". See `ggrepel::geom_text_repel`

for details; the `max.overlaps`

argument can be supplied to `plot()`

to increase the maximum number of element overlap in the plot. Alternatively, users can increase the size of the plot when exporting it, as it will increase the plot area and reduce the number of elements overlap. This warning can generally be ignored, though.

### See Also

`vignette("char-part")`

for the use of this function as part of an analysis pipeline.

`get_gower_dist`

, `get_sil_widths`

, `cluster_to_nexus`

### Examples

```
# See vignette("char-part") for how to use this
# function as part of an analysis pipeline
data("characters")
# Reading example file as categorical data
Dmatrix <- get_gower_dist(characters)
sil_widths <- get_sil_widths(Dmatrix, max.k = 7)
sil_widths
# 3 clusters yields the highest silhouette width
# Create clusters with PAM under k=3 partitions
cluster_df <- make_clusters(Dmatrix, k = 3)
# Simple plot of clusters
plot(cluster_df, seed = 12345)
# Create clusters with PAM under k=3 partitions and perform
# tSNE (3 dimensions; default is 2)
cluster_df_tsne <- make_clusters(Dmatrix, k = 3, tsne = TRUE,
tsne_dim = 2)
# Plot clusters, plots divided into 2 rows, and increasing
# overlap of text labels (default = 10)
plot(cluster_df_tsne, nrow = 2, max.overlaps = 20)
```

*EvoPhylo*version 0.3.2 Index]