sim2data {simDAG} | R Documentation |
Transform sim_discrete_time
output into the start-stop, long- or wide-format
Description
This function transforms the output of the sim_discrete_time
function into a single data.table
structured in the start-stop format (also known as counting process format), the long format (one row per person per point in time) or the wide format (one row per person, one column per point in time for time-varying variables). See details.
Usage
sim2data(sim, to, use_saved_states=sim$save_states=="all",
overlap=FALSE, target_event=NULL,
keep_only_first=FALSE, as_data_frame=FALSE,
check_inputs=TRUE, ...)
## S3 method for class 'simDT'
as.data.table(x, keep.rownames=FALSE, to, overlap=FALSE,
target_event=NULL, keep_only_first=FALSE,
use_saved_states=x$save_states=="all",
check_inputs=TRUE, ...)
## S3 method for class 'simDT'
as.data.frame(x, row.names=NULL, optional=FALSE, to,
overlap=FALSE, target_event=NULL,
keep_only_first=FALSE,
use_saved_states=x$save_states=="all",
check_inputs=TRUE, ...)
Arguments
sim , x |
An object created with the |
to |
Specifies the format of the output data. Must be one of: |
use_saved_states |
Whether the saved simulation states (argument |
overlap |
Only used when |
target_event |
Only used when |
keep_only_first |
Only used when |
as_data_frame |
Set this argument to |
check_inputs |
Whether to perform input checks ( |
keep.rownames |
Currently not used. |
row.names |
Passed to the |
optional |
Passed to the |
... |
Further arguments passed to |
Details
The raw output of the sim_discrete_time
function may be difficult to use for further analysis. Using one of these functions, it is straightforward to transform that output into three different formats, which are described below. Note that some caution needs to be applied when using this function, which is also described below. Both as.data.table
and as.data.frame
internally call sim2data
and only exist for user convenience.
The start-stop format:
The start-stop format (to="start_stop"
), also known as counting process or period format corresponds to a data.table
containing multiple rows per person, where each row corresponds to a period of time in which no variables changed. These intervals are defined by the start
and stop
columns. The start
column gives the time at which the period started, the stop
column denotes the time when the period ended. By default these intervals are coded to be non-overlapping, meaning that the edges of the periods are included in the period itself. For example, if the respective period is exactly 1 point in time long, start
will be equal to stop
. If non-overlapping periods are desired, the user can specify overlap=TRUE
instead.
By default, all time-to-event nodes are treated equally. This is not optimal when the goal is to fit survival regression models. In this case, we usually want the target event to be treated in a special way (see for example Chiou et al. 2023). In general, instead of creating new intervals for it we want existing intervals to end at event times with the corresponding event indicator. This can be achieved by naming the target outcome in the target_event
variable. The previously specified duration of this target event is then ignored. If only the first occurrence of the event is of interest, users may also set keep_only_first=TRUE
to keep only information up until the first event per person.
The long format:
The long format (to="long"
) corresponds to a data.table
in which there is one row per person per point in time. The unique person identifier is stored in the .id
column and the unique points in time are given in the .time
column.
The wide format:
The wide format (to="wide"
) corresponds to a data.table
with exactly one row per person and multiple columns per points in time for each time-varying variable. All time-varying variables are coded as their original variable name with an underscore and the time-point appended to the end. For example, the variable sickness
at time-point 3 is named "sickness_3"
.
Output with use_saved_states=TRUE
:
If use_saved_states=TRUE
, this function will use only the data that is stored in the past_states
list of the sim
object to construct the resulting data.table
. This results in the following behavior, depending on which save_states
option was used in the original sim_discrete_time
function call:
save_states="all"
: A completedata.table
in the desired format with information for all observations at all points in time for all variables will be created. This is the safest option, but also uses the most RAM and computational time.save_states="at_t"
: Adata.table
in the desired format with correct information for all observations at the user specified times (save_states_at
argument) for all variables will be created. The state of the simulation at all other times will be ignored, because it wasn't stored. This may be useful in some scenarios, but is generally discouraged unless you have good reasons to use it. A warning message about this is printed ifcheck_inputs=TRUE
.save_states="last"
: Since only the last state of the simulation was saved, an error message is returned. Nodata.table
is produced.
Output with use_saved_states=FALSE
:
If use_saved_states=FALSE
, this function will use only the data that is stored in the final state of the simulation (data
object in sim
) and information about node_time_to_event
objects. If all tx_nodes
are time_to_event
nodes or if all the user cares about are the time_to_event
nodes and time-fixed variables, this is the best option.
A data.table
in the desired format with correct information about all observations
at all times
is produced, but only with correct entries for some time-varying variables, namely time_to_event
nodes. Note that this information will also only be correct if the user used save_past_events=TRUE
in all time_to_event
nodes. Support for competing_events
nodes will be implemented in the future as well.
The other time-varying variables specified in the tx_nodes
argument will still appear in the output, but it will only be the value that was observed at the last state of the simulation.
Optional columns created using a time_to_event
node:
When using a time-dependent node of type "time_to_event"
with event_count=TRUE
or time_since_last=TRUE
, the columns created using either argument are not included in the output if to="start_stop"
, but will be included if to
is set to either "long"
or "wide"
. The reason for this behavior is that including these columns would lead to nonsense intervals in the start-stop format, but makes sense in the other formats.
What about tx_nodes
that are not time_to_event
nodes?:
If you want the correct output for all tx_nodes
and one or more of those are not time_to_event
nodes, you will have to use save_states="all"
in the original sim_discrete_time
call. We plan to add support for competing_events
with other save_states
arguments in the near future. Support for arbitrary tx_nodes
will probably take longer.
Value
Returns a single data.table
(or data.frame
) containing all simulated variables in the desired format.
Note
Using the node names "start"
, "stop"
, ".id"
, ".time"
or names that are automatically generated by time-dependent nodes of type "time_to_event"
may break this function.
Author(s)
Robin Denz
References
Sy Han Chiou, Gongjun Xu, Jun Yan, and Chiung-Yu Huang (2023). "Regression Modeling for Recurrent Events Possibly with an Informative Terminal Event Using R Package reReg". In: Journal of Statistical Software. 105.5, pp. 1-34.
See Also
Examples
library(simDAG)
set.seed(435345)
## exemplary car crash simulation, where the probability for
## a car crash is dependent on the sex, and the probability of death is
## highly increased for 3 days after a car crash happened
prob_car_crash <- function(data) {
ifelse(data$sex==1, 0.001, 0.01)
}
prob_death <- function(data) {
ifelse(data$car_crash_event, 0.1, 0.001)
}
dag <- empty_dag() +
node("sex", type="rbernoulli", p=0.5) +
node_td("car_crash", type="time_to_event", prob_fun=prob_car_crash,
parents="sex", event_duration=3) +
node_td("death", type="time_to_event", prob_fun=prob_death,
parents="car_crash_event", event_duration=Inf)
# generate some data, only saving the last state
# not a problem here, because the only time-varying nodes are
# time-to-event nodes where the event times are saved
sim <- sim_discrete_time(dag, n_sim=20, max_t=500, save_states="last")
# transform to standard start-stop format
d_start_stop <- sim2data(sim, to="start_stop")
head(d_start_stop)
# transform to "death" centric start-stop format
# and keep only information until death, cause it's a terminal event
# (this could be used in a Cox model)
d_start_stop <- sim2data(sim, to="start_stop", target_event="death",
keep_only_first=TRUE, overlap=TRUE)
head(d_start_stop)
# transform to long-format
d_long <- sim2data(sim, to="long")
head(d_long)
# transform to wide-format
d_wide <- sim2data(sim, to="wide")
#head(d_wide)