find_optimal_n {bioregion}R Documentation

Search for an optimal number of clusters in a list of partitions

Description

This function aims at optimizing one or several criteria on a set of ordered partitions. It is usually applied to find one (or several) optimal number(s) of clusters on, for example, a hierarchical tree to cut, or a range of partitions obtained from k-means or PAM. Users are advised to be careful if applied in other cases (e.g., partitions which are not ordered in an increasing or decreasing sequence, or partitions which are not related to each other).

Usage

find_optimal_n(
  partitions,
  metrics_to_use = "all",
  criterion = "elbow",
  step_quantile = 0.99,
  step_levels = NULL,
  step_round_above = TRUE,
  metric_cutoffs = c(0.5, 0.75, 0.9, 0.95, 0.99, 0.999),
  n_breakpoints = 1,
  plot = TRUE
)

Arguments

partitions

a bioregion.partition.metrics object (output from partition_metrics() or a data.frame with the first two columns named "K" (partition name) and "n_clusters" (number of clusters) and the following columns containing evaluation metrics (numeric values)

metrics_to_use

character string or vector of character strings indicating upon which metric(s) in partitions the optimal number of clusters should be calculated. Defaults to "all" which means all metrics available in partitions will be used

criterion

character string indicating the criterion to be used to identify optimal number(s) of clusters. Available methods currently include "elbow", "increasing_step", "decreasing_step", "cutoff", "breakpoints", "min" or "max". Default is "elbow". See details.

step_quantile

if "increasing_step" or "decreasing_step", specify here the quantile of differences between two consecutive k to be used as the cutoff to identify the most important steps in eval_metric

step_levels

if "increasing_step" or "decreasing_step", specify here the number of largest steps to keep as cutoffs.

step_round_above

a boolean indicating if the optimal number of clusters should be picked above or below the identified steps. Indeed, each step will correspond to a sudden increase or decrease between partition X & partition X+1: should the optimal partition be X+1 (step_round_above = TRUE) or X (step_round_above = FALSE? Defaults to TRUE

metric_cutoffs

if criterion = "cutoff", specify here the cutoffs of eval_metric at which the number of clusters should be extracted

n_breakpoints

specify here the number of breakpoints to look for in the curve. Defaults to 1

plot

a boolean indicating if a plot of the first eval_metric should be drawn with the identified optimal numbers of cutoffs

Details

This function explores the relationship evaluation metric ~ number of clusters, and a criterion is applied to search an optimal number of clusters.

Please read the note section about the following criteria.

Foreword:

Here we implemented a set of criteria commonly found in the literature or recommended in the bioregionalisation literature. Nevertheless, we also advocate to move beyond the "Search one optimal number of clusters" paradigm, and consider investigating "multiple optimal numbers of clusters". Indeed, using only one optimal number of clusters may simplify the natural complexity of biological datasets, and, for example, ignore the often hierarchical / nested nature of bioregionalisations. Using multiple partitions likely avoids this oversimplification bias and may convey more information. See, for example, the reanalysis of Holt et al. (2013) by (Ficetola et al. 2017), where they used deep, intermediate and shallow cuts.

Following this rationale, several of the criteria implemented here can/will return multiple "optimal" numbers of clusters, depending on user choices.

Criteria to find optimal number(s) of clusters

Value

a list of class bioregion.optimal.n with three elements:

Note

Please note that finding the optimal number of clusters is a procedure which normally requires decisions from the users, and as such can hardly be fully automatized. Users are strongly advised to read the references indicated below to look for guidance on how to choose their optimal number(s) of clusters. Consider the "optimal" numbers of clusters returned by this function as first approximation of the best numbers for your bioregionalisation.

Author(s)

Boris Leroy (leroy.boris@gmail.com), Maxime Lenormand (maxime.lenormand@inrae.fr) and Pierre Denelle (pierre.denelle@gmail.com)

References

Castro-Insua A, Gómez-Rodríguez C, Baselga A (2018). “Dissimilarity measures affected by richness differences yield biased delimitations of biogeographic realms.” Nature Communications, 9(1), 9–11.

Ficetola GF, Mazel F, Thuiller W (2017). “Global determinants of zoogeographical boundaries.” Nature Ecology & Evolution, 1, 0089.

Holt BG, Lessard J, Borregaard MK, Fritz SA, Araújo MB, Dimitrov D, Fabre P, Graham CH, Graves GR, Jønsson Ka, Nogués-Bravo D, Wang Z, Whittaker RJ, Fjeldså J, Rahbek C (2013). “An update of Wallace's zoogeographic regions of the world.” Science, 339(6115), 74–78.

Kreft H, Jetz W (2010). “A framework for delineating biogeographical regions based on species distributions.” Journal of Biogeography, 37, 2029–2053.

Langfelder P, Zhang B, Horvath S (2008). “Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R.” BIOINFORMATICS, 24(5), 719–720.

Examples

comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001),
20, 25)
rownames(comat) <- paste0("Site",1:20)
colnames(comat) <- paste0("Species",1:25)

comnet <- mat_to_net(comat)

dissim <- dissimilarity(comat, metric = "all")

# User-defined number of clusters
tree1 <- hclu_hierarclust(dissim,
                          n_clust = 2:15,
                          index = "Simpson")
tree1

a <- partition_metrics(tree1,
                   dissimilarity = dissim,
                   net = comnet,
                   species_col = "Node2",
                   site_col = "Node1",
                   eval_metric = c("tot_endemism",
                                   "avg_endemism",
                                   "pc_distance",
                                   "anosim"))
                                   
find_optimal_n(a)
find_optimal_n(a, criterion = "increasing_step")
find_optimal_n(a, criterion = "decreasing_step")
find_optimal_n(a, criterion = "decreasing_step",
               step_levels = 3) 
find_optimal_n(a, criterion = "decreasing_step",
               step_quantile = .9) 
find_optimal_n(a, criterion = "decreasing_step",
               step_levels = 3) 
find_optimal_n(a, criterion = "decreasing_step",
               step_levels = 3)                 
find_optimal_n(a, criterion = "breakpoints")             


[Package bioregion version 1.1.1 Index]