compare_partitions {bioregion}R Documentation

Compare cluster memberships among multiple partitions

Description

This function aims at computing pairwise comparisons for several partitions, usually on outputs from netclu_, hclu_ or nhclu_ functions. It also provides the confusion matrix from pairwise comparisons, so that the user can compute additional comparison metrics.

Usage

compare_partitions(
  cluster_object,
  sample_comparisons = NULL,
  indices = c("rand", "jaccard"),
  cor_frequency = FALSE,
  store_pairwise_membership = TRUE,
  store_confusion_matrix = TRUE
)

Arguments

cluster_object

a bioregion.clusters object or a data.frame or a list of data.frame containing multiple partitions. At least two partitions are required. If a list of data.frame is provided, they should all have the same number of rows (i.e., same items in the clustering for all partitions).

sample_comparisons

NULL or a positive integer. Reduce computation time by sampling a number of pairwise comparisons in cluster membership of items. Useful if the number of items clustered is high. Suggested values 5000 or 10000.

indices

NULL or character. Indices to compute for the pairwise comparison of partitions. Current available metrics are "rand" and "jaccard"

cor_frequency

a boolean. If TRUE, then computes the correlation between each partition and the total frequency of co-membership of items across all partitions. Useful to identify which partition(s) is(are) most representative of all the computed partitions.

store_pairwise_membership

a boolean. If TRUE, the pairwise membership of items is stored in the output object.

store_confusion_matrix

a boolean. If TRUE, the confusion matrices of pairwise partition comparisons are stored in the output object.

Details

This function proceeds in two main steps:

  1. The first step is done within each partition. It will compare all pairs of items and document if they are clustered together (TRUE) or separately (FALSE) in each partition. For example, if site 1 and site 2 are clustered in the same cluster in partition 1, then the pairwise membership site1_site2 will be TRUE. The output of this first step is stored in the slot pairwise_membership if store_pairwise_membership = TRUE.

  2. The second step compares all pairs of partitions by analysing if their pairwise memberships are similar or not. To do so, for each pair of partitions, the function computes a confusion matrix with four elements:

The confusion matrix is stored in confusion_matrix if store_confusion_matrix = TRUE.

Based on the confusion matrices, we can compute a range of indices to indicate the agreement among partitions. As of now, we have implemented:

These two metrics are complementary, because the Jaccard index will tell if partitions are similar in their clustering structure, whereas the Rand index will tell if partitions are similar not only in the pairs of items clustered together, but also in terms of the pairs of sites that are not clustered together. For example, take two partitions which never group together the same pairs of sites. Their Jaccard index will be 0, whereas the Rand index can be > 0 due to the sites that are not grouped together.

Additional indices can be manually computed by the users on the basis of the list of confusion matrices.

In some cases, users may be interested in finding which of the partitions is most representative of all partitions. To find it out, we can compare the pairwise membership of each partition with the total frequency of pairwise membership across all partitions. This correlation can be requested with cor_frequency = TRUE

Value

A list with 4 to 7 elements:

Author(s)

Boris Leroy (leroy.boris@gmail.com), Maxime Lenormand (maxime.lenormand@inrae.fr) and Pierre Denelle (pierre.denelle@gmail.com)

See Also

partition_metrics

Examples

# A simple case with four partitions of four items
partitions <- data.frame(matrix(nr = 4, nc = 4, 
                                c(1,2,1,1,1,2,2,1,2,1,3,1,2,1,4,2),
                                byrow = TRUE))
partitions
compare_partitions(partitions)

# Find out which partitions are most representative
compare_partitions(partitions,
                   cor_frequency = TRUE)
                                


[Package bioregion version 1.1.1 Index]