R: Transform 'sim_discrete_time' output into the start-stop,...

sim2data {simDAG}

R Documentation

Transform `sim_discrete_time` output into the start-stop, long- or wide-format

Description

This function transforms the output of the sim_discrete_time function into a single data.table structured in the start-stop format (also known as counting process format), the long format (one row per person per point in time) or the wide format (one row per person, one column per point in time for time-varying variables). See details.

Usage

sim2data(sim, to, use_saved_states=sim$save_states=="all",
         overlap=FALSE, target_event=NULL,
         keep_only_first=FALSE, as_data_frame=FALSE,
         check_inputs=TRUE, ...)

## S3 method for class 'simDT'
as.data.table(x, keep.rownames=FALSE, to, overlap=FALSE,
              target_event=NULL, keep_only_first=FALSE,
              use_saved_states=x$save_states=="all",
              check_inputs=TRUE, ...)

## S3 method for class 'simDT'
as.data.frame(x, row.names=NULL, optional=FALSE, to,
              overlap=FALSE, target_event=NULL,
              keep_only_first=FALSE,
              use_saved_states=x$save_states=="all",
              check_inputs=TRUE, ...)

Arguments

`sim`, `x`	An object created with the `sim_discrete_time` function.
`to`	Specifies the format of the output data. Must be one of: `"start_stop"`, `"long"`, `"wide"`.
`use_saved_states`	Whether the saved simulation states (argument `save_states` in `sim_discrete_time` function) should be used to construct the resulting data or not. See details.
`overlap`	Only used when `to="start_stop"`. Specifies whether the intervals should overlap or not. If `TRUE`, the `"stop"` column is simply increased by one, as compared to the output when `overlap=FALSE`. This means that changes for a given `t` are recorded at the start of the next interval, but the previous interval ends on that same day.
`target_event`	Only used when `to="start_stop"`. By default (keeping this argument at `NULL`) all time-to-event nodes are treated equally when creating the start-stop intervals. This can be changed by supplying a single character string to this argument, naming one time-to-event node. This node will then be treated as the outcome. The output then corresponds to what would be needed to fit a Cox proportional hazards model. See details.
`keep_only_first`	Only used when `to="start_stop"` and `target_event` is not `NULL`. Either `TRUE` or `FALSE` (default). If `TRUE`, all information after the first event per person will be discarded. Useful when `target_event` should be treated as a terminal variable.
`as_data_frame`	Set this argument to `TRUE` to return a `data.frame` instead of a `data.table`.
`check_inputs`	Whether to perform input checks (`TRUE` by default). Prints warning messages if the output may be incorrect due to missing information.
`keep.rownames`	Currently not used.
`row.names`	Passed to the `as.data.frame` function which is called on the finished `data.table`. See `?as.data.frame` for more information.
`optional`	Passed to the `as.data.frame` function which is called on the finished `data.table`. See `?as.data.frame` for more information.
`...`	Further arguments passed to `as.data.frame` (conversion from finished `data.table` to `data.frame`). Only available when directly calling `sim2data` with `as_data_frame=TRUE` or when using `as.data.frame.simDT`.

Details

The raw output of the sim_discrete_time function may be difficult to use for further analysis. Using one of these functions, it is straightforward to transform that output into three different formats, which are described below. Note that some caution needs to be applied when using this function, which is also described below. Both as.data.table and as.data.frame internally call sim2data and only exist for user convenience.

The start-stop format:

The start-stop format (to="start_stop"), also known as counting process or period format corresponds to a data.table containing multiple rows per person, where each row corresponds to a period of time in which no variables changed. These intervals are defined by the start and stop columns. The start column gives the time at which the period started, the stop column denotes the time when the period ended. By default these intervals are coded to be non-overlapping, meaning that the edges of the periods are included in the period itself. For example, if the respective period is exactly 1 point in time long, start will be equal to stop. If non-overlapping periods are desired, the user can specify overlap=TRUE instead.

By default, all time-to-event nodes are treated equally. This is not optimal when the goal is to fit survival regression models. In this case, we usually want the target event to be treated in a special way (see for example Chiou et al. 2023). In general, instead of creating new intervals for it we want existing intervals to end at event times with the corresponding event indicator. This can be achieved by naming the target outcome in the target_event variable. The previously specified duration of this target event is then ignored. If only the first occurrence of the event is of interest, users may also set keep_only_first=TRUE to keep only information up until the first event per person.

The long format:

The long format (to="long") corresponds to a data.table in which there is one row per person per point in time. The unique person identifier is stored in the .id column and the unique points in time are given in the .time column.

The wide format:

The wide format (to="wide") corresponds to a data.table with exactly one row per person and multiple columns per points in time for each time-varying variable. All time-varying variables are coded as their original variable name with an underscore and the time-point appended to the end. For example, the variable sickness at time-point 3 is named "sickness_3".

Output with use_saved_states=TRUE:

If use_saved_states=TRUE, this function will use only the data that is stored in the past_states list of the sim object to construct the resulting data.table. This results in the following behavior, depending on which save_states option was used in the original sim_discrete_time function call:

save_states="all": A complete data.table in the desired format with information for all observations at all points in time for all variables will be created. This is the safest option, but also uses the most RAM and computational time.
save_states="at_t": A data.table in the desired format with correct information for all observations at the user specified times (save_states_at argument) for all variables will be created. The state of the simulation at all other times will be ignored, because it wasn't stored. This may be useful in some scenarios, but is generally discouraged unless you have good reasons to use it. A warning message about this is printed if check_inputs=TRUE.
save_states="last": Since only the last state of the simulation was saved, an error message is returned. No data.table is produced.

Output with use_saved_states=FALSE:

If use_saved_states=FALSE, this function will use only the data that is stored in the final state of the simulation (data object in sim) and information about node_time_to_event objects. If all tx_nodes are time_to_event nodes or if all the user cares about are the time_to_event nodes and time-fixed variables, this is the best option.

A data.table in the desired format with correct information about all observations at all times is produced, but only with correct entries for some time-varying variables, namely time_to_event nodes. Note that this information will also only be correct if the user used save_past_events=TRUE in all time_to_event nodes. Support for competing_events nodes will be implemented in the future as well.

The other time-varying variables specified in the tx_nodes argument will still appear in the output, but it will only be the value that was observed at the last state of the simulation.

Optional columns created using a time_to_event node:

When using a time-dependent node of type "time_to_event" with event_count=TRUE or time_since_last=TRUE, the columns created using either argument are not included in the output if to="start_stop", but will be included if to is set to either "long" or "wide". The reason for this behavior is that including these columns would lead to nonsense intervals in the start-stop format, but makes sense in the other formats.

What about tx_nodes that are not time_to_event nodes?:

If you want the correct output for all tx_nodes and one or more of those are not time_to_event nodes, you will have to use save_states="all" in the original sim_discrete_time call. We plan to add support for competing_events with other save_states arguments in the near future. Support for arbitrary tx_nodes will probably take longer.

Value

Returns a single data.table (or data.frame) containing all simulated variables in the desired format.

Note

Using the node names "start", "stop", ".id", ".time" or names that are automatically generated by time-dependent nodes of type "time_to_event" may break this function.

Author(s)

Robin Denz

References

Sy Han Chiou, Gongjun Xu, Jun Yan, and Chiung-Yu Huang (2023). "Regression Modeling for Recurrent Events Possibly with an Informative Terminal Event Using R Package reReg". In: Journal of Statistical Software. 105.5, pp. 1-34.

Examples

library(simDAG)

set.seed(435345)

## exemplary car crash simulation, where the probability for
## a car crash is dependent on the sex, and the probability of death is
## highly increased for 3 days after a car crash happened
prob_car_crash <- function(data) {
  ifelse(data$sex==1, 0.001, 0.01)
}

prob_death <- function(data) {
  ifelse(data$car_crash_event, 0.1, 0.001)
}

dag <- empty_dag() +
  node("sex", type="rbernoulli", p=0.5) +
  node_td("car_crash", type="time_to_event", prob_fun=prob_car_crash,
          parents="sex", event_duration=3) +
  node_td("death", type="time_to_event", prob_fun=prob_death,
          parents="car_crash_event", event_duration=Inf)

# generate some data, only saving the last state
# not a problem here, because the only time-varying nodes are
# time-to-event nodes where the event times are saved
sim <- sim_discrete_time(dag, n_sim=20, max_t=500, save_states="last")

# transform to standard start-stop format
d_start_stop <- sim2data(sim, to="start_stop")
head(d_start_stop)

# transform to "death" centric start-stop format
# and keep only information until death, cause it's a terminal event
# (this could be used in a Cox model)
d_start_stop <- sim2data(sim, to="start_stop", target_event="death",
                         keep_only_first=TRUE, overlap=TRUE)
head(d_start_stop)

# transform to long-format
d_long <- sim2data(sim, to="long")
head(d_long)

# transform to wide-format
d_wide <- sim2data(sim, to="wide")
#head(d_wide)

[Package simDAG version 0.1.2 Index]

Transform sim_discrete_time output into the start-stop, long- or wide-format