cpThreshold {CliquePercolation}R Documentation

Optimizing k And I For Clique Percolation Community Detection

Description

Function for determining threshold value(s) (ratio of largest to second largest community sizes, chi, entropy) of ranges of k and I values to help deciding for optimal k and I values.

Usage

cpThreshold(
  W,
  method = c("unweighted", "weighted", "weighted.CFinder"),
  k.range,
  I.range,
  threshold = c("largest.components.ratio", "chi", "entropy")
)

Arguments

W

A qgraph object or a symmetric matrix; see also qgraph

method

A string indicating the method to use ("unweighted", "weighted", or "weighted.CFinder"). See cpAlgorithm for more information

k.range

integer or vector of k value(s) for which threshold(s) are determined See cpAlgorithm for more information

I.range

integer or vector of I value(s) for which threshold(s) are determined See cpAlgorithm for more information

threshold

A string or vector indicating which threshold(s) to determine ("largest.components.ratio", "chi", "entropy"); see Details

Details

Optimizing k (clique size) and I (Intensity threshold) in clique percolation community detection is a difficult task. Farkas et al. (2007) recommend to look at the ratio of the largest to second largest community sizes (threshold = "largest.components.ratio") for very large networks or the variance of the community sizes when removing the community size of the largest community (threshold = "chi") for somewhat smaller networks. These thresholds were derived from percolation theory. If I for a certain k is too high, no community will be identified. If I is too low, a giant community with all nodes emerges. Just above this I, the distribution of community sizes often follows a power law, which constitutes a broad community sizes distribution. Farkas et al. (2007) point out, that for such I, the ratio of the largest to second largest community sizes is approximately 2, constituting one way to optimize I for each possible k. For somewhat smaller networks, the ratio can be rather unstable. Instead, Farkas et al. (2007, p.8) propose to look at the variance of the community sizes after removing the largest community. The idea is that when I is rather low, one giant community and multiple equally small ones occur. Then, the variance of the community sizes of the small communities (removing the giant community) is low. When I is high, only a few equally small communities will occur. Then, the variance of the community sizes (after removing the largest community) will also be low. In between, the variance will at some point be maximal, namely when the community size distribution is maximally broad (power law-distributed). Thus, the maximal variance could be used to optimize I for various k.

For very small networks, optimizing k and I based on the distribution of the community sizes will be impossible, as too few communities will occur. Another possible threshold for such networks is based on the entropy of the community sizes (threshold = "entropy"). Entropy can be interpreted as an indicator of how surprising the respective solution is. The formula used here is based on Shannon Information, namely

-∑_{i=1}^N p_i * \log_2 p_i

with p_i being the probability that a node is part of community i. For instance, if there are two communities, one of size 5 and one of size 3, the result would be

-((5/8 * \log_2 5/8) + (3/8 * \log_2 3/8)) = 1.46

When calculating entropy, the isolated nodes identified by clique percolation are treated as a separate community. If there is only one community or only isolated nodes, entropy is zero, indicating that the surprisingness is low. As compared to the ratio and chi thresholds, entropy favors communities that are equal in size. Thus, it should not be used for larger networks for which a broader community size distribution is preferred. Note that the entropy threshold has not been validated for clique percolation as of now. Initial simulation studies indicate that it consistently detects surprising community partitions in smaller networks especially if there are cliques of larger k.

Ratio thresholds can be determined only if there are at least two communities. Chi threshold can be determined only if there are at least three communities. If there are not enough communities for the respective threshold, their values are NA in the data frame. Entropy can always be determined.

Value

A data frame with columns for k, I (if method = "weighted" or method = "weighted.CFinder"), number of communities, number of isolated nodes, and results of the specified threshold(s).

Author(s)

Jens Lange, lange.jens@outlook.com

References

Farkas, I., Abel, D., Palla, G., & Vicsek, T. (2007). Weighted network modules. New Journal of Physics, 9, 180-180. http://doi.org/10.1088/1367-2630/9/6/180

Examples

## Example for unweighted networks

# create qgraph object
W <- matrix(c(0,1,1,1,0,0,0,0,
              0,0,1,1,0,0,0,0,
              0,0,0,0,0,0,0,0,
              0,0,0,0,1,1,1,0,
              0,0,0,0,0,1,1,0,
              0,0,0,0,0,0,1,0,
              0,0,0,0,0,0,0,1,
              0,0,0,0,0,0,0,0), nrow = 8, ncol = 8, byrow = TRUE)
W <- Matrix::forceSymmetric(W)
W <- qgraph::qgraph(W)

# determine entropy threshold for k = 3 and k = 4
results <- cpThreshold(W = W, method = "unweighted", k.range = c(3,4), threshold = "entropy")

## Example for weighted networks; three large communities with I = 0.3, 0.2, and 0.1, respectively

# create qgraph object
W <- matrix(c(0,0.10,0,0,0,0,0.10,0.10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
              0,0,0.10,0,0,0,0,0.10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
              0,0,0,0.10,0,0,0,0.10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
              0,0,0,0,0.10,0,0,0.10,0.20,0,0,0,0,0.20,0.20,0,0,0,0,0,0,0,
              0,0,0,0,0,0.10,0,0.10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
              0,0,0,0,0,0,0.10,0.10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
              0,0,0,0,0,0,0,0.10,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
              0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
              0,0,0,0,0,0,0,0,0,0.20,0,0,0,0,0.20,0,0,0,0,0,0,0,
              0,0,0,0,0,0,0,0,0,0,0.20,0,0,0,0.20,0,0,0,0,0,0,0,
              0,0,0,0,0,0,0,0,0,0,0,0.20,0,0,0.20,0,0,0,0,0,0,0,
              0,0,0,0,0,0,0,0,0,0,0,0,0.20,0,0.20,0,0,0,0,0,0,0,
              0,0,0,0,0,0,0,0,0,0,0,0,0,0.20,0.20,0.30,0,0,0,0,0.30,0.30,
              0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.20,0,0,0,0,0,0,0,
              0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
              0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.30,0,0,0,0,0.30,
              0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.30,0,0,0,0.30,
              0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.30,0,0,0.30,
              0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.30,0,0.30,
              0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.30,0.30,
              0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.30,
              0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0), nrow = 22, ncol = 22, byrow = TRUE) 
W <- Matrix::forceSymmetric(W)
W <- qgraph::qgraph(W, layout = "spring", edge.labels = TRUE)

# determine ratio, chi, and entropy thresholds for k = 3 and I from 0.3 to 0.09
results <- cpThreshold(W = W, method = "weighted", k.range = 3,
                       I.range = c(seq(0.3, 0.09, by = -0.01)),
                       threshold = c("largest.components.ratio","chi","entropy"))

## Example with Obama data set (see ?Obama)

# get data
data(Obama)

# estimate network
net <- qgraph::EBICglasso(qgraph::cor_auto(Obama), n = nrow(Obama))

# determine entropy threshold for k from 3 to 4 and I from 0.1 to 0.5
threshold <- cpThreshold(net, method = "weighted",
                         k.range = 3:4,
                         I.range = seq(0.1, 0.5, 0.01),
                         threshold = "entropy")


[Package CliquePercolation version 0.3.0 Index]