node {simcausal} | R Documentation |
Create Node Object(s)
Description
Define a single DAG node and its distribution or define many repeated-measure/time-varying nodes by using argument t
.
The node distribution is allowed to vary as a function of time (t
). Conditionaing on past nodes is accomplished by using the syntactic sugar, such as, NodeName[t]
.
After all the nodes have been added to the DAG, call set.DAG
, a DAG object constructor, and add.action
, an action (intervention) constructor.
Usage
node(name, t, distr, EFU, order, ..., params = list(), asis.params = list())
Arguments
name |
Character node name or a vector of names when specifying a multivariate node. For time-dependent nodes the names will be automatically expanded to a scheme "name_t" for each t provided specified. |
t |
Node time-point(s). Allows specification of several time-points when t is a vector of positive integers, in which case the output will consist of a named list of length(t) nodes, corresponding to each value in t. |
distr |
Character name of the node distribution, can be a standard distribution R function, s.a. rnorm, rbinom, runif or user defined. The function must accept a named argument "n" to specify the total sample size. Distributional parameters (arguments) must be passed as either named arguments to node or as a named list of parameters "params". |
EFU |
End-of-Follow Up flag for designating a survival/censoring type node, only applies to Bernoulli nodes. When |
order |
An optional integer parameter specifying the order in which these nodes will be sampled. The value of order has to start at 1 and be unique for each new node,
can be specified as a range / vector and has to be of the same length as the argument |
... |
Named arguments specifying distribution parameters that are accepted by the |
params |
A list of additional named parameters to be passed on to the |
asis.params |
(ADVANCED USE) A list of additional named distributional parameters that will be evaluated "as is",
inside the currently simulated data.frame + the calling environment, without any modifications to the R expression strings inside the |
Value
A list containing node object(s) (expanded to several nodes if t is an integer vector of length > 1)
Details
The combination of name
and t
must uniquely identify each node in the DAG. Use argument t
to identify measurements of the same attribute (e.g. 'A1c') at various time points.
The collection of all unique t
values, across all nodes, should consist of non-negative integers (i.e., starting at 0).
The optional order
argument can be specified, used for determining the sampling order of each node.
When order
not specified, it is automatically inferred based on the actual order in which the nodes were added to the DAG (earlier added nodes get lower order
value and are sampled first)
All node calls that share the same generic name name
must also share the same EFU
value (if any is specified in at least one of them).
A value of TRUE
for the EFU
indicates that if a simulated value for a measurement of the attribute represented by node is 1
then all the following nodes with that measurement (in terms of higher t
values) in the DAG will be unobserved (i.e., their simulated value will be set to NA).
Node formulas (parameters of the distribution)
Each user-supplied argument to the node function is an evaluable R expression, their evaluation is delayed until the actual simulation time.
These arguments can refer to standard or user-specified R functions that must only apply to the values of parent nodes,
i.e. a subset of the node(s) with an order
value strictly lower than that of the node characterized by the formula.
Formulas must reference the parent nodes with unique name
identifiers, employing the square bracket vector subsetting name[t]
for referencing a
parent node at a particular time point t
(if any time-points were specified).
The square bracket notation is used to index a generic name with the relevant time point as illustrated in the examples.
When an input node is used to define several nodes (i.e., several measurement of the same attribute, t=0:5
), the formula(s) specified in that node can apply
to each node indexed by a given time point denoted by t
. This generic expression t
can then be referenced within a formula to simultaneously identify a
different set of parent nodes for each time point as illustrated below. Note that the parents of each node represented by a given node
object are implicitly defined
by the nodes referenced in formulas of that node
call.
Different types of evaluation for node function arguments
There is quite a bit of flexibility in the way in which the node
function arguments can be evaluated.
By default, the named arguments specified as expressions are first captured by delayed-evaluation and then
modified by simcausal
to enable the special types of functional syntax.
For example, simcausal will over-ride the subsetting operators '[...]
' (for time varying nodes) and '[[...]]' (for networks), implying
that these operators can no longer be used in their typical R
way.
Furthermore, simcausal will over-ride the standard 'c' function, with its own definition. Similarly, it will over-ride any calls to sum
and mean
functions
with their row-matrix counterpart functions rowSums
and rowMeans
.
When programming with simcausal
(such as passing node arguments inside a function, prior to defining the node), it may be helpful to instead pass
such node arguments as character strings, rather than as R expressions. In this case one should use the argument params
by adopting the following syntax node(...,params = list(mean="A+B"))
, which in this case is equivalent to: node(..., mean = A+B)
.
There are also instances when it might be desirable to retain the original behavior of all R
expressions and functions and evaluate a particular node argument "as is".
For example, the user may wish to retain the
original R
functionality of all its operators, including those of [...]
and [[...]]
.
In this case the node argument (or a specific part of the node argument) should be wrapped in .()
or eval()
.
Note that once the expression has been wrapped with .(...)
(or eval(...)
), the simcausal
definitions of operators [...]
and [[...]]
no
longer apply to these expressions and no error checking for "correctness" of these node arguments will be performed.
The forced-evaluation operator .()
can be also used as part of an expression,
which will prevent the typical simcausal
evaluation on only that specific part of the expression. Example 8 below demonstrates the following use case
for the expression .(coefAi[t]) * A[t-1]
,
which will look for vector coefAi
and then subset it by current value of t
(and return a scalar),
while A[t-1]
will evaluate to the entire column vector of variable A
for time point t-1
.
Such an expression will multiply the entire time-varying vector A[t-1]
by scalar value determined
by current value of t
and the previously defined vector coefAi
.
Furthermore, even when a vector or a matrix is wrapped in .(...) it still will be automatically re-parsed into K column matrix with n rows.
When this is not desired, for example, when defining a multivariate node distribution, the user may pass such vector or matrix node arguments as a character string
in a list argument asis.params
. See Example 7 and 8 below for additional details.
Multivariate random variables (multivariate nodes)
Starting from v.0.5, a single node
call can be used for defining a multivariate (and possibly correlated) random vector.
To define a random vector that has more than 1 dimension, use the argument name
to specify a vector with names for each dimension, e.g.,
node(c("X1","X2"), distr = "mvtnorm::rmvnorm", asis.params = list(mean = "c(0,1)", sigma = "matrix(c(1,0.75,0.75,1), ncol=2)"))
will define a bi-variate (correlated) normally distributed node,
the simulated data set will contain this bi-variately distributed random variable in columns "X1" and "X2".
Note that the multivariate sampling distribution function (such as function rmvnorm
from the package mvtnorm
) must return a matrix of
n
rows (number of observations) and length(name)
columns (dimensionality). See additional examples below.
Note that one can also define time-varying multivariate nodes, e.g.,
node(c("X1","X2"), t=0:5, distr = "mvtnorm::rmvnorm", asis.params = list(mean = "c(0,1)"))
.
References
Sofrygin O, van der Laan MJ, Neugebauer R (2017). "simcausal R Package: Conducting Transparent and Reproducible Simulation Studies of Causal Effect Estimation with Complex Longitudinal Data." Journal of Statistical Software, 81(2), 1-47. doi: 10.18637/jss.v081.i02.
Examples
#---------------------------------------------------------------------------------------
# EXAMPLE 1A: Define some Bernoulli nodes, survival outcome Y and put it together in a
# DAG object
#---------------------------------------------------------------------------------------
W1node <- node(name = "W1", distr = "rbern",
prob = plogis(-0.5), order = 1)
W2node <- node(name = "W2", distr = "rbern",
prob = plogis(-0.5 + 0.5 * W1), order = 2)
Anode <- node(name = "A", distr = "rbern",
prob = plogis(-0.5 - 0.3 * W1 - 0.3 * W2), order = 3)
Ynode <- node(name = "Y", distr = "rbern",
prob = plogis(-0.1 + 1.2 * A + 0.3 * W1 + 0.3 * W2), order = 4)
D1Aset <- set.DAG(c(W1node,W2node,Anode,Ynode))
#---------------------------------------------------------------------------------------
# EXAMPLE 1B: Same as 1A using +node interface and no order argument
#---------------------------------------------------------------------------------------
D1B <- DAG.empty()
D1B <- D1B +
node(name = "W1", distr = "rbern", prob = plogis(-0.5)) +
node(name = "W2", distr = "rbern", prob = plogis(-0.5 + 0.5 * W1)) +
node(name = "A", distr = "rbern", prob = plogis(-0.5 - 0.3 * W1 - 0.3 * W2)) +
node(name = "Y", distr = "rbern", prob = plogis(-0.1 + 1.2 * A + 0.3 * W1 + 0.3 * W2))
D1Bset <- set.DAG(D1B)
#---------------------------------------------------------------------------------------
# EXAMPLE 1C: Same as 1A and 1B using add.nodes interface and no order argument
#---------------------------------------------------------------------------------------
D1C <- DAG.empty()
D1C <- add.nodes(D1C, node(name = "W1", distr = "rbern", prob = plogis(-0.5)))
D1C <- add.nodes(D1C, node(name = "W2", distr = "rbern",
prob = plogis(-0.5 + 0.5 * W1)))
D1C <- add.nodes(D1C, node(name = "A", distr = "rbern",
prob = plogis(-0.5 - 0.3 * W1 - 0.3 * W2)))
D1C <- add.nodes(D1C, node(name = "Y", distr = "rbern",
prob = plogis(-0.1 + 1.2 * A + 0.3 * W1 + 0.3 * W2)))
D1C <- set.DAG(D1C)
#---------------------------------------------------------------------------------------
# EXAMPLE 1D: Add a uniformly distributed node and redefine outcome Y as categorical
#---------------------------------------------------------------------------------------
D_unif <- DAG.empty()
D_unif <- D_unif +
node("W1", distr = "rbern", prob = plogis(-0.5)) +
node("W2", distr = "rbern", prob = plogis(-0.5 + 0.5 * W1)) +
node("W3", distr = "runif", min = plogis(-0.5 + 0.7 * W1 + 0.3 * W2), max = 10) +
node("An", distr = "rbern", prob = plogis(-0.5 - 0.3 * W1 - 0.3 * W2 - 0.2 * sin(W3)))
# Categorical syntax 1 (probabilities as values):
D_cat_1 <- D_unif + node("Y", distr = "rcat.b1", probs = {0.3; 0.4})
D_cat_1 <- set.DAG(D_cat_1)
# Categorical syntax 2 (probabilities as formulas):
D_cat_2 <- D_unif +
node("Y", distr = "rcat.b1",
probs={plogis(-0.1 + 1.2 * An + 0.3 * W1 + 0.3 * W2 + 0.2 * cos(W3));
plogis(-0.5 + 0.7 * W1)})
D_cat_2 <- set.DAG(D_cat_2)
#---------------------------------------------------------------------------------------
# EXAMPLE 2A: Define Bernoulli nodes using R rbinom() function, defining prob argument
# for L2 as a function of node L1
#---------------------------------------------------------------------------------------
D <- DAG.empty()
D <- D +
node("L1", t = 0, distr = "rbinom", prob = 0.05, size = 1) +
node("L2", t = 0, distr = "rbinom", prob = ifelse(L1[0] == 1, 0.5, 0.1), size = 1)
Dset <- set.DAG(D)
#---------------------------------------------------------------------------------------
# EXAMPLE 2B: Equivalent to 2A, passing argument size to rbinom inside a named list
# params
#---------------------------------------------------------------------------------------
D <- DAG.empty()
D <- D +
node("L1", t = 0, distr = "rbinom", prob = 0.05, params = list(size = 1)) +
node("L2", t = 0, distr = "rbinom",
prob = ifelse(L1[0] == 1,0.5,0.1), params = list(size = 1))
Dset <- set.DAG(D)
#---------------------------------------------------------------------------------------
# EXAMPLE 2C: Equivalent to 2A and 2B, define Bernoulli nodes using a wrapper "rbern"
#---------------------------------------------------------------------------------------
D <- DAG.empty()
D <- D +
node("L1", t = 0, distr = "rbern", prob = 0.05) +
node("L2", t = 0, distr = "rbern", prob = ifelse(L1[0] == 1, 0.5, 0.1))
Dset <- set.DAG(D)
#---------------------------------------------------------------------------------------
# EXAMPLE 3: Define node with normal distribution using rnorm() R function
#---------------------------------------------------------------------------------------
D <- DAG.empty()
D <- D + node("L2", t = 0, distr = "rnorm", mean = 10, sd = 5)
Dset <- set.DAG(D)
#---------------------------------------------------------------------------------------
# EXAMPLE 4: Define 34 Bernoulli nodes, or 2 Bernoulli nodes over 17 time points,
#---------------------------------------------------------------------------------------
t_end <- 16
D <- DAG.empty()
D <- D +
node("L2", t = 0:t_end, distr = "rbinom",
prob = ifelse(t == t_end, 0.5, 0.1), size = 1) +
node("L1", t = 0:t_end, distr = "rbinom",
prob = ifelse(L2[0] == 1, 0.5, 0.1), size = 1)
Dset <- set.DAG(D)
sim(Dset, n=10)
#---------------------------------------------------------------------------------------
# EXAMPLE 5: Defining new distribution function 'rbern', defining and passing a custom
# vectorized node function 'customfun'
#---------------------------------------------------------------------------------------
rbern <- function(n, prob) { # defining a bernoulli wrapper based on R rbinom function
rbinom(n = n, prob = prob, size = 1)
}
customfun <- function(arg, lambda) {
res <- ifelse(arg == 1, lambda, 0.1)
res
}
D <- DAG.empty()
D <- D +
node("W1", distr = "rbern", prob = 0.05) +
node("W2", distr = "rbern", prob = customfun(W1, 0.5)) +
node("W3", distr = "rbern", prob = ifelse(W1 == 1, 0.5, 0.1))
D1d <- set.DAG(D, vecfun = c("customfun"))
sim(D1d, n = 10, rndseed = 1)
#---------------------------------------------------------------------------------------
# EXAMPLE 6: Defining latent variables I and U.Y (will be hidden from simulated data)
#---------------------------------------------------------------------------------------
D <- DAG.empty()
D <- D +
node("I",
distr = "rcat.b1",
probs = c(0.1, 0.2, 0.2, 0.2, 0.1, 0.1, 0.1)) +
node("W1",
distr = "rnorm",
mean = ifelse(I == 1, 0, ifelse(I == 2, 3, 10)) + 0.6 * I,
sd = 1) +
node("W2",
distr = "runif",
min = 0.025*I, max = 0.7*I) +
node("W3",
distr = "rbern",
prob = plogis(-0.5 + 0.7*W1 + 0.3*W2 - 0.2*I)) +
node("A",
distr = "rbern",
prob = plogis(+4.2 - 0.5*W1 + 0.2*W2/2 + 0.2*W3)) +
node("U.Y", distr = "rnorm", mean = 0, sd = 1) +
node("Y",
distr = "rconst",
const = -0.5 + 1.2*A + 0.1*W1 + 0.3*W2 + 0.2*W3 + 0.2*I + U.Y)
Dset1 <- set.DAG(D, latent.v = c("I", "U.Y"))
sim(Dset1, n = 10, rndseed = 1)
#---------------------------------------------------------------------------------------
# EXAMPLE 7: Multivariate random variables
#---------------------------------------------------------------------------------------
if (requireNamespace("mvtnorm", quietly = TRUE)) {
D <- DAG.empty()
# 2 dimensional normal (uncorrelated), using rmvnorm function from rmvnorm package:
D <- D +
node(c("X1","X2"), distr = "mvtnorm::rmvnorm",
asis.params = list(mean = "c(0,1)")) +
# Can define a degenerate (rconst) multivariate node:
node(c("X1.copy", "X2.copy"), distr = "rconst", const = c(X1, X2))
Dset1 <- set.DAG(D, verbose = TRUE)
sim(Dset1, n = 10)
}
# On the other hand this syntax wont work,
# since simcausal will parse c(0,1) into a two column matrix:
## Not run:
D <- DAG.empty()
D <- D + node(c("X1","X2"), distr = "mvtnorm::rmvnorm", mean = c(0,1))
Dset1 <- set.DAG(D, verbose = TRUE)
## End(Not run)
if (requireNamespace("mvtnorm", quietly = TRUE)) {
D <- DAG.empty()
# Bivariate normal (correlation coef 0.75):
D <- D +
node(c("X1","X2"), distr = "mvtnorm::rmvnorm",
asis.params = list(mean = "c(0,1)",
sigma = "matrix(c(1,0.75,0.75,1), ncol=2)")) +
# Can use any component of such multivariate nodes when defining new nodes:
node("A", distr = "rconst", const = 1 - X1)
Dset2 <- set.DAG(D, verbose = TRUE)
sim(Dset2, n = 10)
}
# Time-varying multivar node (3 time-points, 2 dimensional normal)
# plus changing the mean over time, as as function of t:
if (requireNamespace("mvtnorm", quietly = TRUE)) {
D <- DAG.empty()
D <- D +
node(c("X1", "X2"), t = 0:2, distr = "mvtnorm::rmvnorm",
asis.params = list(
mean = "c(0,1) + t",
sigma = "matrix(rep(0.75,4), ncol=2)"))
Dset5b <- set.DAG(D)
sim(Dset5b, n = 10)
}
# Two ways to define the same bivariate uniform copula:
if (requireNamespace("copula", quietly = TRUE)) {
D <- DAG.empty()
D <- D +
# with a warning since normalCopula() returns an object unknown to simcausal:
node(c("X1","X2"), distr = "copula::rCopula",
copula = eval(copula::normalCopula(0.75, dim = 2))) +
# same, as above:
node(c("X3","X4"), distr = "copula::rCopula",
asis.params = list(copula = "copula::normalCopula(0.75, dim = 2)"))
vecfun.add("qbinom")
# Bivariate binomial derived from previous copula, with same correlation:
D <- D +
node(c("A.Bin1", "A.Bin2"), distr = "rconst",
const = c(qbinom(X1, 10, 0.5), qbinom(X2, 15, 0.7)))
Dset3 <- set.DAG(D)
sim(Dset3, n = 10)
}
# Same as "A.Bin1" and "A.Bin2", but directly using rmvbin function in bindata package:
if (requireNamespace("bindata", quietly = TRUE)) {
D <- DAG.empty()
D <- D +
node(c("B.Bin1","B.Bin2"), distr = "bindata::rmvbin",
asis.params = list(
margprob = "c(0.5, 0.5)",
bincorr = "matrix(c(1,0.75,0.75,1), ncol=2)"))
Dset4b <- set.DAG(D)
sim(Dset4b, n = 10)
}
#---------------------------------------------------------------------------------------
# EXAMPLE 8: Combining simcausal non-standard evaluation with eval() forced evaluation
#---------------------------------------------------------------------------------------
coefAi <- 1:10
D <- DAG.empty()
D <- D +
node("A", t = 1, distr = "rbern", prob = 0.7) +
node("A", t = 2:10, distr = "rconst", const = eval(coefAi[t]) * A[t-1])
Dset8 <- set.DAG(D)
sim(Dset8, n = 10)
#---------------------------------------------------------------------------------------
# TWO equivalent ways of creating a multivariate node (combining nodes W1 and W2):
#---------------------------------------------------------------------------------------
D <- DAG.empty()
D <- D + node("W1", distr = "rbern", prob = 0.45)
D <- D + node("W2", distr = "rconst", const = 1)
# option 1:
D <- D + node(c("W1.copy1", "W2.copy1"), distr = "rconst", const = c(W1, W2))
# equivalent option 2:
create_mat <- function(W1, W2) cbind(W1, W2)
vecfun.add("create_mat")
D <- D + node(c("W1.copy2", "W2.copy2"), distr = "rconst", const = create_mat(W1, W2))
Dset <- set.DAG(D)
sim(Dset, n=10, rndseed=1)