san {ergm} | R Documentation |
Generate networks with a given set of network statistics
Description
This function attempts to find a network or networks whose statistics match
those passed in via the target.stats
vector.
Usage
san(object, ...)
## S3 method for class 'formula'
san(
object,
response = NULL,
reference = ~Bernoulli,
constraints = ~.,
target.stats = NULL,
nsim = NULL,
basis = NULL,
output = c("network", "edgelist", "ergm_state"),
only.last = TRUE,
control = control.san(),
verbose = FALSE,
offset.coef = NULL,
...
)
## S3 method for class 'ergm_model'
san(
object,
reference = ~Bernoulli,
constraints = ~.,
target.stats = NULL,
nsim = NULL,
basis = NULL,
output = c("network", "edgelist", "ergm_state"),
only.last = TRUE,
control = control.san(),
verbose = FALSE,
offset.coef = NULL,
...
)
Arguments
object |
Either a |
... |
Further arguments passed to other functions. |
response |
Either a character string, a formula, or
|
reference |
A one-sided formula specifying
the reference measure ( |
constraints |
A formula specifying one or more constraints
on the support of the distribution of the networks being modeled. Multiple constraints
may be given, separated by “+” and “-” operators. See
The default is to have no constraints except those provided through
the Together with the model terms in the formula and the reference measure, the constraints define the distribution of networks being modeled. It is also possible to specify a proposal function directly either
by passing a string with the function's name (in which case,
arguments to the proposal should be specified through the
Note that not all possible combinations of constraints and reference measures are supported. However, for relatively simple constraints (i.e., those that simply permit or forbid specific dyads or sets of dyads from changing), arbitrary combinations should be possible. |
target.stats |
A vector of the same length as the number of non-offset statistics implied by the formula. |
nsim |
Number of networks to generate. Deprecated: just use |
basis |
If not NULL, a |
output |
Character, one of |
only.last |
if |
control |
A list of control parameters for algorithm tuning,
typically constructed with |
verbose |
A logical or an integer to control the amount of
progress and diagnostic information to be printed. |
offset.coef |
A vector of offset coefficients; these must be passed in by the user.
Note that these should be the same set of coefficients one would pass to |
formula |
(By default, the |
Details
The following description is an exegesis of section 4 of Krivitsky et al. (2022).
Let \mathbf{g}
be a vector of target statistics for the
network we wish to construct. That is, we are given an arbitrary network
\mathbf{y}^0 \in \mathcal{Y}
, and we seek a network
\mathbf{y} \in \mathcal{Y}
such that
\mathbf{g}(\mathbf{y}) \approx \mathbf{g}
– ideally equality is achieved,
but in practice we may have to settle for a close approximation. The
variant of simulated annealing is as follows.
The energy function is defined
E_W (\mathbf{y}) = (\mathbf{g}(\mathbf{y}) - \mathbf{g})^\mathsf{T} W (\mathbf{g}(\mathbf{y}) - \mathbf{g}),
with W
a symmetric positive (barring multicollinearity in statistics)
definite matrix of weights. This function achieves 0 only if the target is
reached. A good choice of this matrix yields a more efficient search.
A standard simulated annealing loop is used, as described below, with some
modifications. In particular, we allow the user to specify a vector of
offsets \eta
to bias the annealing, with \eta_k = 0
denoting no offset. Offsets can be used with SAN to forbid certain
statistics from ever increasing or decreasing. As with ergm()
, offset
terms are specified using the offset()
decorator and their coefficients
specified with the offset.coef
argument. By default, finite offsets are
ignored by, but this can be overridden by setting the control.san()
argument SAN.ignore.finite.offsets = FALSE
.
The number of simulated annealing runs is specified by the SAN.maxit
control parameter and the initial value of the temperature T
is set
to SAN.tau
. The value of T
decreases linearly until T = 0
at the last run, which implies that all proposals that increase
E_W (\mathbf{y})
are rejected. The weight matrix W
is initially set to I_p / p
, where I_p
is the identity matrix
of an appropriate dimension. For weight W
and temperature T
,
the simulated annealing iteration proceeds as follows:
Test if
E_W(\mathbf{y}) = 0
. If so, then exit.Generate a perturbed network
\mathbf{y^*}
from a proposal that respects the model constraints. (This is typically the same proposal as that used for MCMC.)Store the quantity
\mathbf{g}(\mathbf{y^*}) - \mathbf{g}(\mathbf{y})
for later use.Calculate acceptance probability
\alpha = \exp[ - (E_W (\mathbf{y^*}) - E_W (\mathbf{y})) / T + \eta^\mathsf{T} (\mathbf{g}(\mathbf{y^*}) - \mathbf{g}(\mathbf{y}))]
(If
|\eta_k| = \infty
andg_k (\mathbf{y^*}) - g_k (\mathbf{y}) = 0
, their product is defined to be 0.)Replace
\mathbf{y}
with\mathbf{y^*}
with probability\min(1, \alpha)
.
After the specified number of iterations, T
is updated as described
above, and W
is recalculated by first computing a matrix S
, the
sample covariance matrix of the proposed differences stored in Step 3
(i.e., whether or not they were rejected), then
W = S^+ / tr(S^+)
, where S^+
is the
Moore–Penrose pseudoinverse of S
and tr(S^+)
is the
trace of S^+
. The differences in Step 3 closely reflect the
relative variances and correlations among the network statistics.
In Step 2, the many options for MCMC proposals can provide for effective means of speeding the SAN algorithm's search for a viable network.
Value
A network or list of networks that hopefully have network
statistics close to the target.stats
vector. No guarantees
are provided about their probability distribution. Additionally,
attr()
-style attributes formula
and stats
are included.
Methods (by class)
-
san(formula)
: Sufficient statistics are specified by aformula
. -
san(ergm_model)
: A lower-level function that expects a pre-initializedergm_model
.
References
Krivitsky, P. N., Hunter, D. R., Morris, M., & Klumb, C. (2022). ergm 4: Computational Improvements. arXiv preprint arXiv:2203.08198.
Examples
# initialize x to a random undirected network with 50 nodes and a density of 0.1
x <- network(50, density = 0.05, directed = FALSE)
# try to find a network on 50 nodes with 300 edges, 150 triangles,
# and 1250 4-cycles, starting from the network x
y <- san(x ~ edges + triangles + cycle(4), target.stats = c(300, 150, 1250))
# check results
summary(y ~ edges + triangles + cycle(4))
# initialize x to a random directed network with 50 nodes
x <- network(50)
# add vertex attributes
x %v% 'give' <- runif(50, 0, 1)
x %v% 'take' <- runif(50, 0, 1)
# try to find a set of 100 directed edges making the outward sum of
# 'give' and the inward sum of 'take' both equal to 62.5, so in
# edges (i,j) the node i tends to have above average 'give' and j
# tends to have above average 'take'
y <- san(x ~ edges + nodeocov('give') + nodeicov('take'), target.stats = c(100, 62.5, 62.5))
# check results
summary(y ~ edges + nodeocov('give') + nodeicov('take'))
# initialize x to a random undirected network with 50 nodes
x <- network(50, directed = FALSE)
# add a vertex attribute
x %v% 'popularity' <- runif(50, 0, 1)
# try to find a set of 100 edges making the total sum of
# popularity(i) and popularity(j) over all edges (i,j) equal to
# 125, so nodes with higher popularity are more likely to be
# connected to other nodes
y <- san(x ~ edges + nodecov('popularity'), target.stats = c(100, 125))
# check results
summary(y ~ edges + nodecov('popularity'))
# creates a network with denser "core" spreading out to sparser
# "periphery"
plot(y)