sim_discrete_time {simDAG} | R Documentation |
Using Discrete-Time Simulation to Generate Complex Data from a Given DAG and Node Information
Description
Similar to the sim_from_dag
function, this function can be used to generate data from a given DAG. In contrast to the sim_from_dag
function, this function utilizes a discrete-time simulation approach. This is not an "off-the-shelves" simulation function, it should rather be seen as a "framework-function", making it easier to create discrete-time-simulations. It usually requires custom functions written by the user. See details.
Usage
sim_discrete_time(dag, n_sim=NULL, t0_sort_dag=TRUE,
t0_data=NULL, t0_transform_fun=NULL,
t0_transform_args=list(), max_t,
tx_nodes_order=NULL, tx_transform_fun=NULL,
tx_transform_args=list(),
save_states="last", save_states_at=NULL,
verbose=FALSE, check_inputs=TRUE)
Arguments
dag |
A |
n_sim |
A single number specifying how many observations should be generated. If a |
t0_sort_dag |
Corresponds to the |
t0_data |
An optional |
t0_transform_fun |
An optional function that takes the data created at |
t0_transform_args |
A named list of additional arguments passed to the |
max_t |
A single integer specifying the final point in time to which the simulation should be carried out. The simulation will start at |
tx_nodes_order |
A numeric vector specifying the order in which the time-dependent nodes added to the |
tx_transform_fun |
An optional function that takes the data created after every point in time |
tx_transform_args |
A named list of additional arguments passed to the |
save_states |
Specifies the amount of simulation states that should be saved in the output object. Has to be one of |
save_states_at |
The specific points in time at which the simulated |
verbose |
If |
check_inputs |
Whether to perform plausibility checks for the user input or not. Is set to |
Details
Sometimes it is necessary to simulate complex data that cannot be described easily with a single DAG and node information. This may be the case if the desired data should contain multiple time-dependent variables or time-to-event variables in which the event has time-dependent effects on other events. An example for this is data on vaccinations and their effects on the occurrence of adverse events (see vignette). Discrete-Time Simulation can be an effective tool to generate these kinds of datasets.
What is Discrete-Time Simulation?:
In a discrete-time simulation, there are entities who have certain states associated with them that only change at discrete points in time. For example, the entities could be people and the state could be alive or dead. In this example we could generate 100 people with some covariates such as age, sex etc.. We then start by increasing the simulation time by one day. For each person we now check if the person has died using a bernoulli trial, where the probability of dying is generated at each point in time based on some of the covariates. The simulation time is then increased again and the process is repeated until we reach max_t
.
Due to the iterative process it is very easy to simulate arbitrarily complex data. The covariates may change over time in arbitrary ways, the event probability can have any functional relationship with the covariates and so on. If we want to model an event type that is not terminal, such as occurrence of cardiovascular disease, events can easily be simulated to be dependent on the timing and number of previous events. Since Discrete-Time Simulation is a special case of Discrete-Event Simulation, introductory textbooks on the latter can be of great help in getting a better understanding of the former.
How it Works:
Internally, this function works by first simulating data using the sim_from_dag
function. Alternatively, the user can supply a custom data.table
using the t0_data
argument. This data defines the state of all entities at t = 0
. Afterwards, the simulation time is increased by one unit and the data is transformed in place by calling each node function defined by the time-dependent nodes which were added to the dag
using the node_td
function (either in the order in which they were added to the dag
object or by the order defined by the tx_nodes_order
argument). Usually, each transformation changes the state of the entities in some way. For example if there is an age
variable, we would probably increase the age of each person by one time unit at every step. Once max_t
is reached, the resulting data.table
will be returned. It contains the state of all entities at the last step with additional information of when they experienced some events (if node_time_to_event
was used as time-dependent node). Multiple in-depth examples can be found in the vignettes of this package.
Specifying the dag
argument:
The dag
argument should be specified as described in the node
documentation page. More examples specific to discrete-time simulations can be found in the vignettes and the examples. The only difference to specifying a dag
for the sim_from_dag
function is that the dag
here should contain at least one time-dependent node added using the node_td
function.
Speed Considerations:
All functions in this package rely on the data.table
backend in order to make them more memory efficient and faster. It is however important to note that the time to simulate a dataset increases non-linearly with an increasing max_t
value and additional time-dependent nodes. This is usually not a concern for smaller datasets, but if n_sim
is very large (say > 1 million) this function will get rather slow.
What do I do with the output?:
This function outputs a simDT
object, not a data.table
. To obtain an actual dataset from the output of this function, users should use the sim2data
function to transform it into the desired format. Currently, the long-format, the wide-format and the start-stop format are supported. See sim2data
for more information.
A Few Words of Caution:
In most cases it will be necessary for the user to write their own functions in order to actually use the sim_discrete_time
function. Unlike the sim_from_dag
function, in which many popular node types can be implemented in a re-usable way, discrete-time simulation will always require some custom input by the user. This is the price users have to pay for the almost unlimited flexibility offered by this simulation methodology.
Value
Returns a simDT
object, containing some general information about the simulated data as well as the final state of the simulated dataset (and more states, depending on the specification of the save_states
argument). In particular, it includes the following objects:
past_states
: A list containing the generated data at the specified points in time.save_states
: The value of thesave_states
argument supplied by the user.data
: The data at timemax_t
.tte_past_events
: A list storing the times at which events happened in variables of type"time_to_event"
, if specified.ce_past_events
: A list storing the times at which events happened in variables of type"competing_events"
, if specified.ce_past_causes
: A list storing the types of events which happened at in variables of type"competing_events"
, if specified.tx_nodes
: A list of all time-varying nodes, as specified in the supplieddag
object.max_t
: The value ofmax_t
, as supplied by the user.t0_var_names
: A character vector containing the names of all variable names that do not vary over time.
To obtain a single dataset from this function that can be processed further, please use the sim2data
function.
Author(s)
Robin Denz, Katharina Meiszl
References
Tang, Jiangjun, George Leu, und Hussein A. Abbass. 2020. Simulation and Computational Red Teaming for Problem Solving. Hoboken: IEEE Press.
Banks, Jerry, John S. Carson II, Barry L. Nelson, and David M. Nicol (2014). Discrete-Event System Simulation. Vol. 5. Edinburgh Gate: Pearson Education Limited.
See Also
empty_dag
, node
, node_td
, sim2data
, plot.simDT
Examples
library(simDAG)
set.seed(454236)
## simulating death dependent on age, sex, bmi
## NOTE: this example is explained in detail in one of the vignettes
# initializing a DAG with nodes for generating data at t0
dag <- empty_dag() +
node("age", type="rnorm", mean=50, sd=4) +
node("sex", type="rbernoulli", p=0.5) +
node("bmi", type="gaussian", parents=c("sex", "age"),
betas=c(1.1, 0.4), intercept=12, error=2)
# a function that increases age as time goes on
node_advance_age <- function(data) {
return(data$age + 1/365)
}
# a function to calculate the probability of death as a
# linear combination of age, sex and bmi on the log scale
prob_death <- function(data, beta_age, beta_sex, beta_bmi, intercept) {
prob <- intercept + data$age*beta_age + data$sex*beta_sex + data$bmi*beta_bmi
prob <- 1/(1 + exp(-prob))
return(prob)
}
# adding time-dependent nodes to the dag
dag <- dag +
node_td("age", type="advance_age", parents="age") +
node_td("death", type="time_to_event", parents=c("age", "sex", "bmi"),
prob_fun=prob_death, beta_age=0.1, beta_bmi=0.3, beta_sex=-0.2,
intercept=-20, event_duration=Inf, save_past_events=FALSE)
# run simulation for 100 people, 50 days long
sim_dt <- sim_discrete_time(n_sim=100,
dag=dag,
max_t=50,
verbose=FALSE)