cpThreshold {CliquePercolation} | R Documentation |
Optimizing k
And I
For Clique Percolation Community Detection
Description
Function for determining threshold value(s) (ratio of largest to second largest
community sizes, chi, entropy, fuzzy modularity, signed fuzzy modularity) of ranges
of k
and I
values to help deciding for optimal k
and I
values.
Usage
cpThreshold(
W,
method = c("unweighted", "weighted", "weighted.CFinder"),
k.range,
I.range,
threshold = c("largest.components.ratio", "chi", "entropy", "fuzzymod",
"signedfuzzymod")
)
Arguments
W |
A qgraph object or a symmetric matrix; see also qgraph |
method |
A string indicating the method to use
( |
k.range |
integer or vector of |
I.range |
integer or vector of |
threshold |
A string or vector indicating which threshold(s) to determine
( |
Details
Optimizing k
(clique size) and I
(Intensity threshold) in clique percolation
community detection is a difficult task. Farkas et al. (2007) recommend to look at the
ratio of the largest to second largest community sizes
(threshold = "largest.components.ratio"
) for very large networks or
the variance of the community sizes when removing the community size of the largest
community (threshold = "chi"
) for somewhat smaller networks. These thresholds were
derived from percolation theory. If I
for a certain k
is too high, no
community will be identified. If I
is too low, a giant community with all nodes
emerges. Just above this I
, the distribution of community sizes often follows a
power law, which constitutes a broad community sizes distribution. Farkas et al. (2007)
point out, that for such I
, the ratio of the largest to second largest community
sizes is approximately 2, constituting one way to optimize I
for each possible
k
. For somewhat smaller networks, the ratio can be rather unstable. Instead,
Farkas et al. (2007, p.8) propose to look at the variance of the community sizes after
removing the largest community. The idea is that when I
is rather low, one giant
community and multiple equally small ones occur. Then, the variance of the community
sizes of the small communities (removing the giant community) is low. When I
is high, only a few equally small communities will occur. Then, the variance of the
community sizes (after removing the largest community) will also be low. In between,
the variance will at some point be maximal, namely when the community size
distribution is maximally broad (power law-distributed). Thus, the maximal variance
could be used to optimize I
for various k
.
For very small networks, optimizing k
and I
based on the distribution of the
community sizes will be impossible, as too few communities will occur. Another possible
threshold for such networks is based on the entropy of the community sizes
(threshold = "entropy"
). Entropy can be interpreted as an indicator of how
surprising the respective solution is. The formula used here is based on Shannon
Information, namely
-\sum_{i=1}^N p_i * \log_2 p_i
with p_i
being the probability that a node is part of community i
. For instance,
if there are two communities, one of size 5 and one of size 3, the result would be
-((5/8 * \log_2 5/8) + (3/8 * \log_2 3/8)) = 1.46
When calculating entropy, the isolated nodes identified by clique percolation are treated as
a separate community. If there is only one community or only isolated nodes, entropy is
zero, indicating that the surprisingness is low. As compared to the ratio and chi
thresholds, entropy favors communities that are equal in size. Thus, it should not be
used for larger networks for which a broader community size distribution is preferred.
Note that the entropy threshold has not been validated for clique percolation as of now.
Initial simulation studies indicate that it consistently detects surprising community
partitions in smaller networks especially if there are cliques of larger k
.
Santiago et al. (2022) recently proposed in a simulation study that two
alternative metrics, fuzzy modularity and signed fuzzy modularity, showed good performance
in recovering the true community assignment in psychological networks with overlapping nodes
and can also be used to optimize k
(clique size) and I
(Intensity threshold).
See FuzzyMod and SignedFuzzyMod for more information.
Ratio thresholds can be determined only if there are at least two communities. Chi threshold can be determined only if there are at least three communities. If there are not enough communities for the respective threshold, their values are NA in the data frame. Entropy, fuzzy modularity, and signed fuzzy modularity can always be determined.
Value
A data frame with columns for k
, I
(if method = "weighted"
or method = "weighted.CFinder"
), number of communities, number of isolated
nodes, and results of the specified threshold(s).
Author(s)
Jens Lange, lange.jens@outlook.com
References
Farkas, I., Abel, D., Palla, G., & Vicsek, T. (2007). Weighted network modules. New Journal of Physics, 9, 180-180. http://doi.org/10.1088/1367-2630/9/6/180
Santiago, P. H. R., Soares, G. H., Quintero, A., & Jamieson, L. (2022). The performance of the Clique Percolation to identify overlapping symptoms in psychological networks. PsyArXiv. https://psyarxiv.com/fk963/
Examples
## Not run: ## Example for unweighted networks
# create qgraph object
W <- matrix(c(0,1,1,1,0,0,0,0,
0,0,1,1,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,1,1,1,0,
0,0,0,0,0,1,1,0,
0,0,0,0,0,0,1,0,
0,0,0,0,0,0,0,1,
0,0,0,0,0,0,0,0), nrow = 8, ncol = 8, byrow = TRUE)
W <- Matrix::forceSymmetric(W)
W <- qgraph::qgraph(W)
# determine entropy and fuzzy modularity thresholds for k = 3 and k = 4
results <- cpThreshold(W = W, method = "unweighted", k.range = c(3,4), threshold = c("entropy",
"fuzzymod"))
## Example for weighted networks; three large communities with I = 0.3, 0.2, and 0.1, respectively
# create qgraph object
W <- matrix(c(0,0.10,0,0,0,0,0.10,0.10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0.10,0,0,0,0,0.10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0.10,0,0,0,0.10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0.10,0,0,0.10,0.20,0,0,0,0,0.20,0.20,0,0,0,0,0,0,0,
0,0,0,0,0,0.10,0,0.10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0.10,0.10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0.10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0.20,0,0,0,0,0.20,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0.20,0,0,0,0.20,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0.20,0,0,0.20,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0.20,0,0.20,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0.20,0.20,0.30,0,0,0,0,0.30,0.30,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.20,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.30,0,0,0,0,0.30,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.30,0,0,0,0.30,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.30,0,0,0.30,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.30,0,0.30,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.30,0.30,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.30,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0), nrow = 22, ncol = 22, byrow = TRUE)
W <- Matrix::forceSymmetric(W)
W <- qgraph::qgraph(W, layout = "spring", edge.labels = TRUE)
# determine ratio, chi, entropy, fuzzy modularity, and signed fuzzy modularity
# thresholds for k = 3 and I from 0.3 to 0.09
results <- cpThreshold(W = W, method = "weighted", k.range = 3,
I.range = c(seq(0.3, 0.09, by = -0.01)),
threshold = c("largest.components.ratio","chi","entropy",
"fuzzymod","signedfuzzymod"))
## Example with Obama data set (see ?Obama)
# get data
data(Obama)
# estimate network
net <- qgraph::EBICglasso(qgraph::cor_auto(Obama), n = nrow(Obama))
# determine entropy, fuzzy modularity, and signed fuzzy modularity thresholds
# for k from 3 to 4 and I from 0.1 to 0.5
threshold <- cpThreshold(net, method = "weighted",
k.range = 3:4,
I.range = seq(0.1, 0.5, 0.01),
threshold = c("entropy", "fuzzymod", "signedfuzzymod"))
## End(Not run)