node {simDAG} | R Documentation |
Create a node object to grow a DAG step-by-step
Description
These functions should be used in conjunction with the empty_dag
function to create DAG
objects, which can then be used to simulate data using the sim_from_dag
function or the sim_discrete_time
function.
Usage
node(name, type, parents=NULL, formula=NULL, ...)
node_td(name, type, parents=NULL, formula=NULL, ...)
Arguments
name |
A character vector with at least one entry specifying the name of the node. If a character vector containing multiple different names is supplied, one separate node will be created for each name. These nodes are completely independent, but have the exact same node definition as supplied by the user. If only a single character string is provided, only one node is generated. |
type |
A single character string specifying the type of the node. Depending on whether the node is a root node, a child node or a time-dependent node different node types are allowed. See details. |
parents |
A character vector of names, specifying the parents of the node or |
formula |
An optional |
... |
Further named arguments needed to specify the node. Those can be parameters of distribution functions such as the |
Details
To generate data using the sim_from_dag
function or the sim_discrete_time
function, it is required to create a DAG
object first. This object needs to contain information about the causal structure of the data (e.g. which variable causes which variable) and the specific structural equations for each variable (information about causal coefficients, type of distribution etc.). In this package, the node
and/or node_td
function is used in conjunction with the empty_dag
function to create this object.
This works by first initializing an empty DAG
using the empty_dag
function and then adding multiple calls to the node
and/or node_td
functions to it using a simple +
, where each call to node
and/or node_td
adds information about a single node that should be generated. Multiple examples are given below.
In each call to node
or node_td
the user needs to indicate what the node should be called (name
), which function should be used to generate the node (type
), whether the node has any parents and if so which (parents
) and any additional arguments needed to actually call the data-generating function of this node later passed to the three-dot syntax (...
).
node
vs. node_td
:
By calling node
you are indicating that this node is a time-fixed variable which should only be generated once. By using node_td
you are indicating that it is a time-dependent node, which will be updated at each step in time when using a discrete-time simulation.
node_td
should only be used if you are planning to perform a discrete-time simulation with the sim_discrete_time
function. DAG
objects including time-dependent nodes may not be used in the sim_from_dag
function.
Implemented Root Node Types:
Any function can be used to generate root nodes. The only requirement is that the function has at least one named argument called n
which controls the length of the resulting vector. For example, the user could specify a node of type "rnorm"
to create a normally distributed node with no parents. The argument n
will be set internally, but any additional arguments can be specified using the ...
syntax. In the type="rnorm"
example, the user could set the mean and standard deviation using node(name="example", type="rnorm", mean=10, sd=5)
.
For convenience, this package additionally includes three custom root-node functions:
"rbernoulli": Draws randomly from a bernoulli distribution.
"rcategorical": Draws randomly from any discrete probability density function.
"rconstant": Used to set a variable to a constant value.
Implemented Child Node Types:
Currently, the following node types are implemented directly for convenience:
"gaussian": A node based on linear regression.
"binomial": A node based on logistic regression.
"conditional_prob": A node based on conditional probabilities.
"conditional_distr": A node based on conditional draws from different distributions.
"multinomial": A node based on multinomial regression.
"poisson": A node based on poisson regression.
"negative_binomial": A node based on negative binomial regression.
"cox": A node based on cox-regression.
For custom child node types, see below.
Implemented Time-Dependent Node Types:
Currently, the following node types are implemented directly for convenience to use in node_td
calls:
"time_to_event": A node based on repeatedly checking whether an event occurs at each point in time.
"competing_events": A node based on repeatedly checking whether one of multiple mutually exclusive events occurs at each point in time.
However, the user may also use any of the child node types in a node_td
call directly. For custom time-dependent node types, see below.
Custom Node Types
It is very simple to write a new custom node_function
to be used instead, allowing the user to use any type
of data-generation mechanism for any type of node (root / child / time-dependent). All that is required of this function is, that it has the named arguments data
(the sample as generated so far) and, if it's a child node, parents
(a character vector specifying the parents) and outputs either a vector containing n_sim
entries, or a data.frame
with n_sim
rows and an arbitrary amount of columns. More information about this can be found on the node_custom
documentation page.
Using child nodes as parents for other nodes:
Most child nodes can be easily used as parents for other nodes. This allows the resulting DAG to be rather complex. However, if the data generated by the child node is categorical (such as when using node_multinomial
) or when it has complex data structures in general (such as when using node_cox
), it may be difficult to use the output as parents. Using a custom node type, the user may use any node as parents as he or she see fit. Using the nodes of type "cox"
, "multinomial"
or (depending on the utilized parameters) "conditional_prob"
as parents may result in errors when using standard child node types such as "binomial"
or "gaussian"
.
Cyclic causal structures:
The name DAG (directed acyclic graph) implies that cycles are not allowed. This means that if you start from any node and only follow the arrows in the direction they are pointing, there should be no way to get back to your original node. This is necessary both theoretically and for practical reasons if we are dealing with static DAGs created using the node
function. If the user attempts to generate data from a static cyclic graph using the sim_from_dag
function, an error will be produced.
However, in the realm of discrete-time simulations, cyclic causal structures are perfectly reasonable. A variable A
at t = 1
may influence a variable B
at t = 2
, which in turn may influence variable A
at t = 3
again. Therefore, when using the node_td
function to simulate time-dependent data using the sim_discrete_time
function, cyclic structures are allowed to be present and no error will be produced.
Value
Returns a DAG.node
object which can be added to a DAG
object directly.
Note
Contrary to the R standard, this function does NOT support partial matching of argument names. This means that supplying nam="age"
will not be recognized as name="age"
and instead will be added as additional node argument used in the respective data-generating function call when using sim_from_dag
.
Author(s)
Robin Denz
Examples
library(simDAG)
# creating a DAG with a single root node
dag <- empty_dag() +
node("age", type="rnorm", mean=30, sd=4)
# creating a DAG with multiple root nodes
dag <- empty_dag() +
node("sex", type="rbernoulli", p=0.5) +
node("income", type="rnorm", mean=2700, sd=500)
# creating a DAG with multiple root nodes + multiple names in one node
dag <- empty_dag() +
node("sex", type="rbernoulli", p=0.5) +
node(c("income_1", "income_2"), type="rnorm", mean=2700, sd=500)
# also using child nodes
dag <- empty_dag() +
node("sex", type="rbernoulli", p=0.5) +
node("income", type="rnorm", mean=2700, sd=500) +
node("sickness", type="binomial", parents=c("sex", "income"),
betas=c(1.2, -0.3), intercept=-15) +
node("death", type="binomial", parents=c("sex", "income", "sickness"),
betas=c(0.1, -0.4, 0.8), intercept=-20)
# using time-dependent nodes
# NOTE: to simulate data from this DAG, the sim_discrete_time() function needs
# to be used due to "sickness" being a time-dependent node
dag <- empty_dag() +
node("sex", type="rbernoulli", p=0.5) +
node("income", type="rnorm", mean=2700, sd=500) +
node_td("sickness", type="binomial", parents=c("sex", "income"),
betas=c(0.1, -0.4), intercept=-50)
# we could also use a DAG with only time-varying variables
dag <- empty_dag() +
node_td("vaccine", type="time_to_event", prob_fun=0.001, event_duration=21) +
node_td("covid", type="time_to_event", prob_fun=0.01, event_duration=15,
immunity_duration=100)