R: Hierarchical clustering based on dissimilarity or...

hclu_hierarclust {bioregion}

R Documentation

Hierarchical clustering based on dissimilarity or beta-diversity

Description

This function generates a hierarchical tree from a dissimilarity (beta-diversity) data.frame, calculates the cophenetic correlation coefficient, and can get clusters from the tree if requested by the user. The function implements randomization of the dissimilarity matrix to generate the tree, with a selection method based on the optimal cophenetic correlation coefficient. Typically, the dissimilarity data.frame is a bioregion.pairwise.metric object obtained by running similarity or similarity and then similarity_to_dissimilarity.

Usage

hclu_hierarclust(
  dissimilarity,
  index = names(dissimilarity)[3],
  method = "average",
  randomize = TRUE,
  n_runs = 30,
  keep_trials = FALSE,
  optimal_tree_method = "best",
  n_clust = NULL,
  cut_height = NULL,
  find_h = TRUE,
  h_max = 1,
  h_min = 0
)

Arguments

`dissimilarity`	the output object from `dissimilarity()` or `similarity_to_dissimilarity()`, or a `dist` object. If a `data.frame` is used, the first two columns represent pairs of sites (or any pair of nodes), and the next column(s) are the dissimilarity indices.
`index`	name or number of the dissimilarity column to use. By default, the third column name of `dissimilarity` is used.
`method`	name of the hierarchical classification method, as in hclust. Should be one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).
`randomize`	a `boolean` indicating if the dissimilarity matrix should be randomized, to account for the order of sites in the dissimilarity matrix.
`n_runs`	number of trials to randomize the dissimilarity matrix.
`keep_trials`	a `boolean` indicating if all random trial results. should be stored in the output object (set to FALSE to save space if your `dissimilarity` object is large).
`optimal_tree_method`	a `character` indicating how the final tree should be obtained from all trials. The only option currently is "best", which means the tree with the best cophenetic correlation coefficient will be chosen.
`n_clust`	an `integer` or an `integer` vector indicating the number of clusters to be obtained from the hierarchical tree, or the output from partition_metrics. Should not be used at the same time as `cut_height`.
`cut_height`	a `numeric` vector indicating the height(s) at which the tree should be cut. Should not be used at the same time as `n_clust`.
`find_h`	a `boolean` indicating if the height of cut should be found for the requested `n_clust`.
`h_max`	a `numeric` indicating the maximum possible tree height for the chosen `index`.
`h_min`	a `numeric` indicating the minimum possible height in the tree for the chosen `index`.

Details

The function is based on hclust. The default method for the hierarchical tree is average, i.e. UPGMA as it has been recommended as the best method to generate a tree from beta diversity dissimilarity (Kreft and Jetz 2010).

Clusters can be obtained by two methods:

Specifying a desired number of clusters in n_clust
Specifying one or several heights of cut in cut_height

To find an optimal number of clusters, see partition_metrics()

Value

A list of class bioregion.clusters with five slots:

name: character containing the name of the algorithm
args: list of input arguments as provided by the user
inputs: list of characteristics of the clustering process
algorithm: list of all objects associated with the clustering procedure, such as original cluster objects
clusters: data.frame containing the clustering results

In the algorithm slot, users can find the following elements:

trials: a list containing all randomization trials. Each trial contains the dissimilarity matrix, with site order randomized, the associated tree and the cophenetic correlation coefficient (Spearman) for that tree
final.tree: a hclust object containing the final hierarchical tree to be used
final.tree.coph.cor: the cophenetic correlation coefficient between the initial dissimilarity matrix and final.tree

Author(s)

Boris Leroy (leroy.boris@gmail.com), Pierre Denelle (pierre.denelle@gmail.com) and Maxime Lenormand (maxime.lenormand@inrae.fr)

References

Kreft H, Jetz W (2010). “A framework for delineating biogeographical regions based on species distributions.” Journal of Biogeography, 37, 2029–2053.

Examples

comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001),
20, 25)
rownames(comat) <- paste0("Site",1:20)
colnames(comat) <- paste0("Species",1:25)

dissim <- dissimilarity(comat, metric = "all")

# User-defined number of clusters
tree1 <- hclu_hierarclust(dissim, n_clust = 5)
tree1
plot(tree1)
str(tree1)
tree1$clusters

# User-defined height cut
# Only one height
tree2 <- hclu_hierarclust(dissim, cut_height = .05)
tree2
tree2$clusters

# Multiple heights
tree3 <- hclu_hierarclust(dissim, cut_height = c(.05, .15, .25))

tree3$clusters # Mind the order of height cuts: from deep to shallow cuts
# Info on each partition can be found in table cluster_info
tree3$cluster_info
plot(tree3)

# Recut the tree afterwards
tree3.1 <- cut_tree(tree3, n = 5)

tree4 <- hclu_hierarclust(dissim, n_clust = 1:19)

[Package bioregion version 1.1.1 Index]