bigergm {bigergm}R Documentation

bigergm: Exponential-family random graph models for large networks with local dependence

Description

The function bigergm estimates and simulates three classes of exponential-family random graph models for large networks under local dependence:

  1. The p_1 model of Holland and Leinhardt (1981) in exponential-family form and extensions by Vu, Hunter, and Schweinberger (2013), Schweinberger, Petrescu-Prahova, and Vu (2014), Dahbura et al. (2021), and Fritz et al. (2024) to both directed and undirected random graphs with additional model terms, with and without covariates.

  2. The stochastic block model of Snijders and Nowicki (1997) and Nowicki and Snijders (2001) in exponential-family form.

  3. The exponential-family random graph models with local dependence of Schweinberger and Handcock (2015), with and without covariates. The exponential-family random graph models with local dependence replace the long-range dependence of conventional exponential-family random graph models by short-range dependence. Therefore, exponential-family random graph models with local dependence replace the strong dependence of conventional exponential-family random graph models by weak dependence, reducing the problem of model degeneracy (Handcock, 2003; Schweinberger, 2011) and improving goodness-of-fit (Schweinberger and Handcock, 2015). In addition, exponential-family random graph models with local dependence satisfy a weak form of self-consistency in the sense that these models are self-consistent under neighborhood sampling (Schweinberger and Handcock, 2015), which enables consistent estimation of neighborhood-dependent parameters (Schweinberger and Stewart, 2017; Schweinberger, 2017).

Usage

bigergm(
  object,
  add_intercepts = FALSE,
  n_blocks = NULL,
  n_cores = 1,
  blocks = NULL,
  estimate_parameters = TRUE,
  verbose = 0,
  n_MM_step_max = 100,
  tol_MM_step = 1e-04,
  initialization = "infomap",
  use_infomap_python = FALSE,
  virtualenv_python = "r-bigergm",
  seed_infomap = NULL,
  weight_for_initialization = 1000,
  seed = NULL,
  method_within = "MPLE",
  control_within = ergm::control.ergm(),
  clustering_with_features = TRUE,
  compute_pi = FALSE,
  check_alpha_update = FALSE,
  check_blocks = FALSE,
  cache = NULL,
  return_checkpoint = TRUE,
  only_use_preprocessed = FALSE,
  ...
)

Arguments

object

An R formula object or bigergm class object. If a formula is given, the function estimates a new model specified by it. It needs to be of the form y ~ <model terms>, where y is a network object. For the details on the possible <model terms>, see ergmTerm and Morris, Handcock and Hunter (2008). All terms that induce dependence are excluded from the between block model, while the within block model includes all terms. When you pass a bigergm class object to the function, you continue from the previous MM step. Note that the block allocation (which is either provided by parameter blocks or estimated in the first step) is saved as the vertex.attribute block of the network. This attribute can also be used in the specified formula. The L-ergmTerm is supported to enable size-dependent coefficients for the within-blocks model. Note, however, that for size-dependent parameters of terms that are included in the between-blocks model, the intercept in the linear model provided to L-ergmTerm should not include the intercept. See the second example below for a demonstration.

add_intercepts

Boolean value to indicate whether adequate intercepts should be added to the provided formula so that the model in the first stage of the estimation is a nested model of the estimated model in the second stage of the estimation.

n_blocks

The number of blocks. This must be specified by the user. When you pass a bigergm class object to the function, you don't have to specify this argument.

n_cores

The number of CPU cores to use.

blocks

The pre-specified block memberships for each node. If NULL, the latent community structure is estimated, assuming that the number of communities is n_blocks.

estimate_parameters

If TRUE, both clustering and parameter estimation are implemented. If FALSE, only clustering is executed.

verbose

A logical or an integer: if this is TRUE/1, the program will print out additional information about the progress of estimation and simulation. A higher value yields lower level information.

n_MM_step_max

The maximum number of MM iterations. Currently, no early stopping criteria is introduced. Thus n_MM_step_max MM iterations are exactly implemented.

tol_MM_step

Tolerance regarding the relative change of the lower bound of the likelihood used to decide on the convergence of the clustering step

initialization

How the blocks should be initialized. If infomap (the default), igraph' or Python's infomap is implemented. If random, the initial clusters are randomly uniformally selected. If spectral, spectral clustering is conducted. If walktrap, the walktrap clustering algorithm as implemented in cluster_walktrap is conducted. If initialization is a vector of integers of the same length as the number of nodes in the provided network (in object), then the provided vector is used as the initial cluster assignment. If initialization is a string relating to a file path, bigergm will interpret it as block allocations saved in Python's infomap .clu format under that path.

use_infomap_python

If TRUE, the cluster initialization is implemented using Pythons' infomap.

virtualenv_python

Which virtual environment should be used for the infomap algorithm?

seed_infomap

seed value (integer) for the infomap algorithm, which can be used to initialize the estimation of the blocks.

weight_for_initialization

weight value used for cluster initialization. The higher this value, the more weight is put on the initialized block allocation.

seed

seed value (integer) for the random number generator.

method_within

If "MPLE" (the default), then the maximum pseudolikelihood estimator is implemented when estimating the within-block network model. If "MLE", then an approximate maximum likelihood estimator is conducted. If "CD" (EXPERIMENTAL), the Monte-Carlo contrastive divergence estimate is returned.

control_within

A list of control parameters for the ergm function used to estimate the parameters of the within model. See control.ergm for details.

clustering_with_features

If TRUE, clustering is implemented using the discrete covariates specified in the formula.

compute_pi

If TRUE, this function keeps track of pi matrices at each MM iteration. If the network is large, we strongly recommend to set to be FALSE.

check_alpha_update

If TRUE, this function keeps track of alpha matrices at each MM iteration. If the network is large, we strongly recommend to set to be FALSE.

check_blocks

If TRUE, this function keeps track of estimated block memberships at each MM iteration.

cache

a cachem cache object used to store intermediate calculations such as eigenvector decomposition results.

return_checkpoint

If TRUE, the function returns the checkpoint list. For most applications, this should be set to TRUE but if memory space needed by the output is an issue, set to FALSE.

only_use_preprocessed

If TRUE, the function only uses the preprocessed data from a previous fit but does not continue the estimation from its final iteration, instead the estimation is started again from the provided initialization.

...

Additional arguments, to be passed to lower-level functions (mainly to the ergm function used for the estimation of within-block connections).

Value

An object of class 'bigergm' including the results of the fitted model. These include:

call:

call of the mode

block:

vector of the found block of the nodes into cluster

initial_block:

vector of the initial block of the nodes into cluster

sbm_pi:

Connection probabilities represented as a n_blocks x n_blocks matrix from the first stage of the estimation between all clusters

MM_list_z:

list of cluster allocation for each node and each iteration

MM_list_alpha:

list of posterior distributions of cluster allocations for all nodes for each iteration

MM_change_in_alpha:

change in 'alpha' for each iteration

MM_lower_bound:

vector of the evidence lower bounds from the MM algorithm

alpha:

matrix representing the converged posterior distributions of cluster allocations for all nodes

counter_e_step:

integer number indicating the number of iterations carried out

adjacency_matrix:

sparse matrix representing the adjacency matrix used for the estimation

estimation_status:

character stating the status of the estimation

est_within:

ergm object of the model for within cluster connections

est_between:

ergm object of the model for between cluster connections

checkpoint:

list of information to continue the estimation (only returned if return_checkpoint = TRUE)

membership_before_kmeans:

vector of the found blocks of the nodes into cluster before the final check for bad clusters

estimate_parameters:

binary value if the parameters in the second step of the algorithm should be estimated or not

References

Babkin, S., Stewart, J., Long, X., and M. Schweinberger (2020). Large-scale estimation of random graph models with local dependence. Computational Statistics and Data Analysis, 152, 1–19.

Dahbura, J. N. M., Komatsu, S., Nishida, T. and Mele, A. (2021), ‘A structural model of business cards exchange networks’. https://arxiv.org/abs/2105.12704

Fritz C., Georg C., Mele A., and Schweinberger M. (2024). A strategic model of software dependency networks. https://arxiv.org/abs/2402.13375

Handcock, M. S. (2003). Assessing degeneracy in statistical models of social networks. Technical report, Center for Statistics and the Social Sciences, University of Washington, Seattle.
https://csss.uw.edu/Papers/wp39.pdf

Holland, P. W. and S. Leinhardt (1981). An exponential family of probability distributions for directed graphs. Journal of the American Statistical Association, Theory & Methods, 76, 33–65.

Morris M, Handcock MS, Hunter DR (2008). Specification of Exponential-Family Random Graph Models: Terms and Computational Aspects. Journal of Statistical Software, 24.

Nowicki, K. and T. A. B. Snijders (2001). Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, Theory & Methods, 96, 1077–1087.

Schweinberger, M. (2011). Instability, sensitivity, and degeneracy of discrete exponential families. Journal of the American Statistical Association, Theory & Methods, 106, 1361–1370.

Schweinberger, M. (2020). Consistent structure estimation of exponential-family random graph models with block structure. Bernoulli, 26, 1205–1233.

Schweinberger, M. and M. S. Handcock (2015). Local dependence in random graph models: characterization, properties, and statistical inference. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 7, 647-676.

Schweinberger, M., Krivitsky, P. N., Butts, C.T. and J. Stewart (2020). Exponential-family models of random graphs: Inference in finite, super, and infinite population scenarios. Statistical Science, 35, 627-662.

Schweinberger, M. and P. Luna (2018). HERGM: Hierarchical exponential-family random graph models. Journal of Statistical Software, 85, 1–39.

Schweinberger, M., Petrescu-Prahova, M. and D. Q. Vu (2014). Disaster response on September 11, 2001 through the lens of statistical network analysis. Social Networks, 37, 42–55.

Schweinberger, M. and J. Stewart (2020). Concentration and consistency results for canonical and curved exponential-family random graphs. The Annals of Statistics, 48, 374–396.

Snijders, T. A. B. and K. Nowicki (1997). Estimation and prediction for stochastic blockmodels for graphs with latent block structure. Journal of Classification, 14, 75–100.

Stewart, J., Schweinberger, M., Bojanowski, M., and M. Morris (2019). Multilevel network data facilitate statistical inference for curved ERGMs with geometrically weighted terms. Social Networks, 59, 98–119.

Vu, D. Q., Hunter, D. R. and M. Schweinberger (2013). Model-based clustering of large networks. Annals of Applied Statistics, 7, 1010–1039.

Examples

# Load an embedded network object.
data(toyNet)

# Specify the model that you would like to estimate.
model_formula <- toyNet ~ edges + nodematch("x") + nodematch("y") + triangle
# Estimate the model
bigergm_res <- bigergm(
  object = model_formula,
  # The model you would like to estimate
  n_blocks = 4,
  # The number of blocks
  n_MM_step_max = 10,
  # The maximum number of MM algorithm steps
  estimate_parameters = TRUE,
  # Perform parameter estimation after the block recovery step
  clustering_with_features = TRUE,
  # Indicate that clustering must take into account nodematch on characteristics
  check_blocks = FALSE)
  
 # Example with N() operator
 
 ## Not run: 
set.seed(1)
# Prepare ingredients for simulating a network
N <- 500
K <- 10

list_within_params <- c(1, 2, 2,-0.5)
list_between_params <- c(-8, 0.5, -0.5)
formula <- g ~ edges + nodematch("x") + nodematch("y")  + N(~edges,~log(n)-1)

memb <- sample(1:K,prob = c(0.1,0.2,0.05,0.05,0.10,0.1,0.1,0.1,0.1,0.1), 
               size = N, replace = TRUE)
vertex_id <- as.character(11:(11 + N - 1))

x <- sample(1:2, size = N, replace = TRUE)
y <- sample(1:2, size = N, replace = TRUE)


df <- tibble::tibble(
  id = vertex_id,
  memb = memb,
  x = x,
  y = y
)
g <- network::network.initialize(n = N, directed = FALSE)
g %v% "vertex.names" <- df$id
g %v% "block" <- df$memb
g %v% "x" <- df$x
g %v% "y" <- df$y

# Simulate a network
g_sim <-
  simulate_bigergm(
    formula = formula,
    coef_within = list_within_params,
    coef_between = list_between_params,
    nsim = 1, 
    control_within = control.simulate.formula(MCMC.burnin = 200000))

estimation <- bigergm(update(formula,new = g_sim~.), n_blocks = 10, 
                      verbose = T)
summary(estimation)

## End(Not run)

[Package bigergm version 1.2.1 Index]