R: Extract Link Communities from a Network

getLinkCommunities {linkcomm}

R Documentation

Extract Link Communities from a Network

Description

This function extracts link communities from networks of arbitrary size and type.

Usage

getLinkCommunities(network, hcmethod = "average", use.all.edges = FALSE,
                   edglim = 10^4, directed = FALSE, dirweight = 0.5,
                   bipartite = FALSE, dist = NULL, plot = TRUE, 
                   check.duplicates = TRUE, removetrivial = TRUE, 
                   verbose = TRUE)

Arguments

`network`	An edge list, which is a matrix or data frame with 2 or 3 columns. The first 2 columns contain the nodes that interact with each other, which can be character strings or integer values. The optional third column is a numerical vector of weights for each edge. Can also be a character string naming a file containing an edge list.
`hcmethod`	A character string naming the hierarchical clustering method to use. Can be one of `"ward"`, `"single"`, `"complete"`, `"average"`, `"mcquitty"`, `"median"`, or `"centroid"`. Defaults to `"average"` (if the number of edges is greater than `edglim` then `"single"` is used).
`use.all.edges`	Logical, indicating whether edge similarities should be calculated for all pairs of edges (TRUE), or only for edge pairs that share a node (FALSE) as in the original Ahn et al. (2010) algorithm. Defaults to FALSE. If TRUE, networks are treated as undirected.
`edglim`	An integer value indicating the largest number of edges permissible for the hierarchical clustering to be handled in memory. Above this value the upper triangular dissimilarity matrix will be written to disk and read and written as clustering proceeds until the file size is 0 bytes (see Details below). Defaults to `10^{4}`.
`directed`	Logical, whether the network is directed. Defaults to FALSE.
`dirweight`	A numerical value between 1 and 0 inclusive indicating the weight that will be attached to edges that share a node but are in the opposite orientation. Defaults to 0.5. Will be ignored if `directed` = FALSE.
`bipartite`	Logical, whether the input network is bi-partite. See Details for an explanation of how bi-partite networks are handled. Defaults to FALSE.
`dist`	An object of class `"dist"` representing a user-defined distance matrix for the network. Note, this must be the lower triangular matrix of an n*n distance matrix, where n is the number of edges in the network (make sure duplicated edges are removed). If NULL, then the distance matrix is calculated by the algorithm. Defaults to NULL.
`plot`	Logical, whether to plot summary output from the algorithm (dendrogram and partition density plot). Defaults to TRUE. Note, if there are more than 1500 but less than edglim edges then the dendrogram will be plotted without colour and in a separate panel from the partition density to avoid lengthy rendering times; when there are more than edglim edges then only the partition density will be plotted.
`check.duplicates`	Logical, whether to check for and remove loops, duplicate edges, and bi-directional edges. Defaults to TRUE. Note, if you wish to avoid this step by setting this parameter to FALSE then you must be certain that there are no duplicate edges in the network.
`removetrivial`	Logical, whether to remove trivial community clusters that contain 2 edges. Defaults to TRUE.
`verbose`	Logical, whether to display the progress of the algorithm on the screen. Defaults to TRUE.

Details

This is the main algorithm used for extracting link communities from networks of arbitrary size and type. Input networks may be directed, weighted, both directed and weighted, or neither. The algorithm used is the one outlined by Ahn et al. (2010). The similarity between links, e_{ik} and e_{jk}, that share a node, k, is calculated using the Jaccard coefficient

S(e_{ik},e_{jk})=\frac{|n_{+}(i)\cap n_{+}(j)|}{|n_{+}(i)\cup n_{+}(j)|}

where n_{+}(i) refers to the first-order node neighbourhood of node i, which includes node i itself (inclusive neighbour set). After assigning pairwise similarities to all of the links in the network, the links are hierarchically clustered using single-linkage clustering, and the resulting dendrogram is cut at a point that maximises the density of links within the clusters normalising against the maximum and minimum numbers of links possible in each cluster, known as the partition density. For directed and weighted networks, the Tanimoto coefficient is used for assigning similarity between links

S(e_{ik},e_{jk})=\frac{\mathbf{a}_{i}.\mathbf{a}_{j}}{|\mathbf{a}_{i}|^{2}+|\mathbf{a}_{j}|^{2}-\mathbf{a}_{i}.\mathbf{a}_{j}}

where \mathbf{a}_{i} refers to a vector describing the weights of links between node i and the nodes in the first-order neighbourhoods of both nodes i and j (equal to 0 in the event of an absent link). For directed networks, links to nodes shared by both node i and j are given a user-defined weight below 1 if they are in the opposite orientation.

For bi-partite networks, the set of neighbours (instead of the inclusive neighbour set) is used to count nodes for the edge similarity metric because node i and node j cannot share an edge in a bi-partite network. The partition density for bi-partite networks is calculated as:

D_{c} = \frac{2}{M}\sum_{c}m_{c}\frac{m_{c}+1-n_{c}}{2n_{c0}n_{c1}-2(n_{c}-1)}

where M is the total number of edges, m_c is the number of edges in subset c, n_c is the number of nodes in subset c, n_{c0} is the number of nodes in partition 0, and n_{c1} is the number of nodes in partition 1.

Value

An object of class linkcomm, which is a list containing the following components:

`numbers`	An integer vector with the number of edges, nodes, and communities.
`hclust`	An object of class `hclust`, which contains information about the hierarchical clustering of links.
`pdmax`	A numerical value indicating the height of the dendrogram at which the partition density is maximised.
`pdens`	A numerical matrix with 2 columns; the first is the heights at which clusters appear and the second is the partition density.
`nodeclusters`	A data frame consisting of 2 columns; the first contains node names, and the second contains single community IDs for each node. All communities and their nodes are represented, but not necessarily all nodes.
`clusters`	A list of integer vectors containing the link IDs that belong to each community. Community IDs are the numerical position of the communities in the list.
`edges`	A data frame with 3 columns; the first two contain nodes that interact with each other, and the third is an integer vector of community IDs indicating community membership for each link.
`numclusters`	A named integer vector. Names are node names and integer values are the number of communities to which each node belongs.
`clustsizes`	A named integer vector. Names are community IDs and integer values indicate the number of nodes that belong in each community.
`igraph`	An object of class `igraph`. The network is represented here as an `igraph` object.
`edgelist`	A character matrix with 2 columns containing the nodes that interact with each other.
`directed`	Logical indicating whether the network is directed.
`bipartite`	Logical indicating whether the network is bi-partite.

Note

When the number of links is less than edglim the hierarchical clustering will be handled in memory. Above this value the upper triangular dissimilarity matrix will be compressed and written to disk and read and written as clustering proceeds until the file size is 0 bytes using a compiled C++ function. In this case the hierarchical clustering method will always be "single" to enhance performance for large networks. The size of edglim can be modified to suit the computer resources available to the user. As a guide, a network with 10^{4} links will require ((10^{4})^{2})*8 = 800 MB to be handled in an uncompressed format in the memory.

For directed networks, a pair of bidirectional interactions between two nodes cannot be assigned similarities and the edge that appears lower in the edge list for the network will be discarded.

When use.all.edges is TRUE, the algorithm may be slow as all pairs of edges will be compared (n^2 comparisons, where n is the number of edges).

Author(s)

Alex T. Kalinka alex.t.kalinka@gmail.com

References

Ahn, Y.Y., Bagrow, J.P., and Lehmann, S. (2010). Link communities reveal multiscale complexity in networks. Nature 466, 761-764.

Kalinka, A.T. and Tomancak, P. (2011). linkcomm: an R package for the generation, visualization, and analysis of link communities in networks of arbitrary size and type. Bioinformatics 27, 2011-2012.

Examples

## Generate graph and extract link communities.
g <- swiss[,3:4]
lc <- getLinkCommunities(g)

## Extract communities by writing a temporary file to disk.
lc <- getLinkCommunities(g, edglim = 10)

## Use similarities between all pairs of edges.
lc <- getLinkCommunities(g, use.all.edges = TRUE)

## Directed network.
lc <- getLinkCommunities(g, directed = TRUE, dirweight = 0.8)

## Weighted network.
g <- cbind(swiss[,3:4], runif(nrow(swiss[,3:4])))
lc <- getLinkCommunities(g)

## Directed and weighted network.
lc <- getLinkCommunities(g, directed = TRUE, dirweight = 0.8)

[Package linkcomm version 1.0-14 Index]