R: Create an undirected degree corrected stochastic blockmodel...

dcsbm {fastRG}

R Documentation

Create an undirected degree corrected stochastic blockmodel object

Description

To specify a degree-corrected stochastic blockmodel, you must specify the degree-heterogeneity parameters (via n or theta), the mixing matrix (via k or B), and the relative block probabilities (optional, via pi). We provide defaults for most of these options to enable rapid exploration, or you can invest the effort for more control over the model parameters. We strongly recommend setting the expected_degree or expected_density argument to avoid large memory allocations associated with sampling large, dense graphs.

Usage

dcsbm(
  n = NULL,
  theta = NULL,
  k = NULL,
  B = NULL,
  ...,
  pi = rep(1/k, k),
  sort_nodes = TRUE,
  force_identifiability = FALSE,
  poisson_edges = TRUE,
  allow_self_loops = TRUE
)

Arguments

`n`	(degree heterogeneity) The number of nodes in the blockmodel. Use when you don't want to specify the degree-heterogeneity parameters `theta` by hand. When `n` is specified, `theta` is randomly generated from a `LogNormal(2, 1)` distribution. This is subject to change, and may not be reproducible. `n` defaults to `NULL`. You must specify either `n` or `theta`, but not both.
`theta`	(degree heterogeneity) A numeric vector explicitly specifying the degree heterogeneity parameters. This implicitly determines the number of nodes in the resulting graph, i.e. it will have `length(theta)` nodes. Must be positive. Setting to a vector of ones recovers a stochastic blockmodel without degree correction. Defaults to `NULL`. You must specify either `n` or `theta`, but not both.
`k`	(mixing matrix) The number of blocks in the blockmodel. Use when you don't want to specify the mixing-matrix by hand. When `k` is specified, the elements of `B` are drawn randomly from a `Uniform(0, 1)` distribution. This is subject to change, and may not be reproducible. `k` defaults to `NULL`. You must specify either `k` or `B`, but not both.
`B`	(mixing matrix) A `k` by `k` matrix of block connection probabilities. The probability that a node in block `i` connects to a node in community `j` is `Poisson(B[i, j])`. Must be a square matrix. `matrix` and `Matrix` objects are both acceptable. If `B` is not symmetric, it will be symmetrized via the update `B := B + t(B)`. Defaults to `NULL`. You must specify either `k` or `B`, but not both.
`...`	Arguments passed on to `undirected_factor_model` `expected_degree` If specified, the desired expected degree of the graph. Specifying `expected_degree` simply rescales `S` to achieve this. Defaults to `NULL`. Do not specify both `expected_degree` and `expected_density` at the same time. `expected_density` If specified, the desired expected density of the graph. Specifying `expected_density` simply rescales `S` to achieve this. Defaults to `NULL`. Do not specify both `expected_degree` and `expected_density` at the same time.
`pi`	(relative block probabilities) Relative block probabilities. Must be positive, but do not need to sum to one, as they will be normalized internally. Must match the dimensions of `B` or `k`. Defaults to `rep(1 / k, k)`, or a balanced blocks.
`sort_nodes`	Logical indicating whether or not to sort the nodes so that they are grouped by block and by `theta`. Useful for plotting. Defaults to `TRUE`. When `TRUE`, nodes are first sorted by block membership, and then by degree-correction parameters within each block. Additionally, `pi` is sorted in increasing order, and the columns of the `B` matrix are permuted to match the new order of `pi`.
`force_identifiability`	Logical indicating whether or not to normalize `theta` such that it sums to one within each block. Defaults to `FALSE`, since this behavior can be surprise when `theta` is set to a vector of all ones to recover the DC-SBM case.
`poisson_edges`	Logical indicating whether or not multiple edges are allowed to form between a pair of nodes. Defaults to `TRUE`. When `FALSE`, sampling proceeds as usual, and duplicate edges are removed afterwards. Further, when `FALSE`, we assume that `S` specifies a desired between-factor connection probability, and back-transform this `S` to the appropriate Poisson intensity parameter to approximate Bernoulli factor connection probabilities. See Section 2.3 of Rohe et al. (2017) for some additional details.
`allow_self_loops`	Logical indicating whether or not nodes should be allowed to form edges with themselves. Defaults to `TRUE`. When `FALSE`, sampling proceeds allowing self-loops, and these are then removed after the fact.

Value

An undirected_dcsbm S3 object, a subclass of the undirected_factor_model() with the following additional fields:

theta: A numeric vector of degree-heterogeneity parameters.
z: The community memberships of each node, as a factor(). The factor will have k levels, where k is the number of communities in the stochastic blockmodel. There will not always necessarily be observed nodes in each community.
pi: Sampling probabilities for each block.
sorted: Logical indicating where nodes are arranged by block (and additionally by degree heterogeneity parameter) within each block.

Generative Model

There are two levels of randomness in a degree-corrected stochastic blockmodel. First, we randomly chose a block membership for each node in the blockmodel. This is handled by dcsbm(). Then, given these block memberships, we randomly sample edges between nodes. This second operation is handled by sample_edgelist(), sample_sparse(), sample_igraph() and sample_tidygraph(), depending depending on your desired graph representation.

Block memberships

Let z_i represent the block membership of node i. To generate z_i we sample from a categorical distribution (note that this is a special case of a multinomial) with parameter \pi, such that \pi_i represents the probability of ending up in the ith block. Block memberships for each node are independent.

Degree heterogeneity

In addition to block membership, the DCSBM also allows nodes to have different propensities for edge formation. We represent this propensity for node i by a positive number \theta_i. Typically the \theta_i are constrained to sum to one for identifiability purposes, but this doesn't really matter during sampling (i.e. without the sum constraint scaling B and \theta has the same effect on edge probabilities, but whether B or \theta is responsible for this change is uncertain).

Edge formulation

Once we know the block memberships z and the degree heterogeneity parameters theta, we need one more ingredient, which is the baseline intensity of connections between nodes in block i and block j. Then each edge A_{i,j} is Poisson distributed with parameter

\lambda[i, j] = \theta_i \cdot B_{z_i, z_j} \cdot \theta_j.

Examples


set.seed(27)

lazy_dcsbm <- dcsbm(n = 1000, k = 5, expected_density = 0.01)
lazy_dcsbm

# sometimes you gotta let the world burn and
# sample a wildly dense graph

dense_lazy_dcsbm <- dcsbm(n = 500, k = 3, expected_density = 0.8)
dense_lazy_dcsbm

# explicitly setting the degree heterogeneity parameter,
# mixing matrix, and relative community sizes rather
# than using randomly generated defaults

k <- 5
n <- 1000
B <- matrix(stats::runif(k * k), nrow = k, ncol = k)

theta <- round(stats::rlnorm(n, 2))

pi <- c(1, 2, 4, 1, 1)

custom_dcsbm <- dcsbm(
  theta = theta,
  B = B,
  pi = pi,
  expected_degree = 50
)

custom_dcsbm

edgelist <- sample_edgelist(custom_dcsbm)
edgelist

# efficient eigendecompostion that leverages low-rank structure in
# E(A) so that you don't have to form E(A) to find eigenvectors,
# as E(A) is typically dense. computation is
# handled via RSpectra

population_eigs <- eigs_sym(custom_dcsbm)

[Package fastRG version 0.3.2 Index]