find_optimal_n {bioregion} | R Documentation |
This function aims at optimizing one or several criteria on a set of ordered partitions. It is usually applied to find one (or several) optimal number(s) of clusters on, for example, a hierarchical tree to cut, or a range of partitions obtained from k-means or PAM. Users are advised to be careful if applied in other cases (e.g., partitions which are not ordered in an increasing or decreasing sequence, or partitions which are not related to each other).
find_optimal_n(
partitions,
metrics_to_use = "all",
criterion = "elbow",
step_quantile = 0.99,
step_levels = NULL,
step_round_above = TRUE,
metric_cutoffs = c(0.5, 0.75, 0.9, 0.95, 0.99, 0.999),
n_breakpoints = 1,
plot = TRUE
)
partitions |
a |
metrics_to_use |
character string or vector of character strings
indicating upon which metric(s) in |
criterion |
character string indicating the criterion to be used to
identify optimal number(s) of clusters. Available methods currently include
|
step_quantile |
if |
step_levels |
if |
step_round_above |
a |
metric_cutoffs |
if |
n_breakpoints |
specify here the number of breakpoints to look for in the curve. Defaults to 1 |
plot |
a boolean indicating if a plot of the first |
This function explores the relationship evaluation metric ~ number of clusters, and a criterion is applied to search an optimal number of clusters.
Please read the note section about the following criteria.
Foreword:
Here we implemented a set of criteria commonly found in the literature or recommended in the bioregionalisation literature. Nevertheless, we also advocate to move beyond the "Search one optimal number of clusters" paradigm, and consider investigating "multiple optimal numbers of clusters". Indeed, using only one optimal number of clusters may simplify the natural complexity of biological datasets, and, for example, ignore the often hierarchical / nested nature of bioregionalisations. Using multiple partitions likely avoids this oversimplification bias and may convey more information. See, for example, the reanalysis of Holt et al. (2013) by (Ficetola et al. 2017), where they used deep, intermediate and shallow cuts.
Following this rationale, several of the criteria implemented here can/will return multiple "optimal" numbers of clusters, depending on user choices.
Criteria to find optimal number(s) of clusters
elbow
:
This method consists in finding one elbow in the evaluation metric curve, as
is commonly done in clustering analyses. The idea is to approximate the
number of clusters at which the evaluation metric no longer increments.It is
based on a fast method finding the maximum distance between the curve and a
straight line linking the minimum and maximum number of points. The code we
use here is based on code written by Esben Eickhardt available here
https://stackoverflow.com/questions/2018178/finding-the-best-trade-off-point-on-a-curve/42810075#42810075.
The code has been modified to work on both increasing and decreasing
evaluation metrics.
increasing_step
or decreasing_step
:
This method consists in identifying clusters at the most important changes,
or steps, in the evaluation metric. The objective can be to either look for
largest increases (increasing_step
) or largest decreases
decreasing_step
. Steps are calculated based on the pairwise differences
between partitions. Therefore, this is relative to the distribution of
differences in the evaluation metric over the tested partitions. Specify
step_quantile
as the quantile cutoff above which steps will be selected as
most important (by default, 0.99, i.e. the largest 1\
selected).Alternatively, you can also choose to specify the number of top
steps to keep, e.g. to keep the largest three steps, specify
step_level = 3
. Basically this method will emphasize the most important
changes in the evaluation metric as a first approximation of where important
cuts can be chosen.
**Please note that you should choose between increasing_step
and
decreasing_step
depending on the nature of your evaluation metrics. For
example, for metrics that are monotonously decreasing (e.g., endemism
metrics "avg_endemism" & "tot_endemism"
) with the number of clusters
should n_clusters, you should choose decreasing_step
. On the contrary, for
metrics that are monotonously increasing with the number of clusters (e.g.,
"pc_distance"
), you should choose increasing_step
. **
cutoffs
:
This method consists in specifying the cutoff value(s) in the evaluation
metric from which the number(s) of clusters should be derived. This is the
method used by (Holt et al. 2013). Note, however, that the
cut-offs suggested by Holt et al. (0.9, 0.95, 0.99, 0.999) may be only
relevant at very large spatial scales, and lower cut-offs should be
considered at finer spatial scales.
breakpoints
:
This method consists in finding break points in the curve using a segmented
regression. Users have to specify the number of expected break points in
n_breakpoints
(defaults to 1). Note that since this method relies on a
regression model, it should probably not be applied with a low number of
partitions.
min
& max
:
Picks the optimal partition(s) respectively at the minimum or maximum value
of the evaluation metric.
a list
of class bioregion.optimal.n
with three elements:
args
: input arguments
evaluation_df
: the input evaluation data.frame appended with
boolean
columns identifying the optimal numbers of clusters
optimal_nb_clusters
: a list containing the optimal number(s)
of cluster(s) for each metric specified in "metrics_to_use"
, based on
the chosen criterion
plot
: if requested, the plot will be stored in this slot
Please note that finding the optimal number of clusters is a procedure which normally requires decisions from the users, and as such can hardly be fully automatized. Users are strongly advised to read the references indicated below to look for guidance on how to choose their optimal number(s) of clusters. Consider the "optimal" numbers of clusters returned by this function as first approximation of the best numbers for your bioregionalisation.
Boris Leroy (leroy.boris@gmail.com), Maxime Lenormand (maxime.lenormand@inrae.fr) and Pierre Denelle (pierre.denelle@gmail.com)
Castro-Insua A, Gómez-Rodríguez C, Baselga A (2018). “Dissimilarity measures affected by richness differences yield biased delimitations of biogeographic realms.” Nature Communications, 9(1), 9–11.
Ficetola GF, Mazel F, Thuiller W (2017). “Global determinants of zoogeographical boundaries.” Nature Ecology & Evolution, 1, 0089.
Holt BG, Lessard J, Borregaard MK, Fritz SA, Araújo MB, Dimitrov D, Fabre P, Graham CH, Graves GR, Jønsson Ka, Nogués-Bravo D, Wang Z, Whittaker RJ, Fjeldså J, Rahbek C (2013). “An update of Wallace's zoogeographic regions of the world.” Science, 339(6115), 74–78.
Kreft H, Jetz W (2010). “A framework for delineating biogeographical regions based on species distributions.” Journal of Biogeography, 37, 2029–2053.
Langfelder P, Zhang B, Horvath S (2008). “Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R.” BIOINFORMATICS, 24(5), 719–720.
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001),
20, 25)
rownames(comat) <- paste0("Site",1:20)
colnames(comat) <- paste0("Species",1:25)
comnet <- mat_to_net(comat)
dissim <- dissimilarity(comat, metric = "all")
# User-defined number of clusters
tree1 <- hclu_hierarclust(dissim,
n_clust = 2:15,
index = "Simpson")
tree1
a <- partition_metrics(tree1,
dissimilarity = dissim,
net = comnet,
species_col = "Node2",
site_col = "Node1",
eval_metric = c("tot_endemism",
"avg_endemism",
"pc_distance",
"anosim"))
find_optimal_n(a)
find_optimal_n(a, criterion = "increasing_step")
find_optimal_n(a, criterion = "decreasing_step")
find_optimal_n(a, criterion = "decreasing_step",
step_levels = 3)
find_optimal_n(a, criterion = "decreasing_step",
step_quantile = .9)
find_optimal_n(a, criterion = "decreasing_step",
step_levels = 3)
find_optimal_n(a, criterion = "decreasing_step",
step_levels = 3)
find_optimal_n(a, criterion = "breakpoints")