gformula {gfoRmula} | R Documentation |
Estimation of Survival Outcome, Continuous End-of-Follow-Up Outcome, or Binary End-of-Follow-Up Outcome Under the Parametric G-Formula
Description
Based on an observed data set, this function estimates the risk over time (for survival outcomes), outcome mean at end-of-follow-up (for continuous end-of-follow-up outcomes), or outcome probability at end-of-follow-up (for binary end-of-follow-up outcomes) under multiple user-specified interventions using the parametric g-formula. See McGrath et al. (2020) for further details concerning the application and implementation of the parametric g-formula.
Usage
gformula(
obs_data,
id,
time_points = NULL,
time_name,
covnames,
covtypes,
covparams,
covfits_custom = NA,
covpredict_custom = NA,
histvars = NULL,
histories = NA,
basecovs = NA,
outcome_name,
outcome_type,
ymodel,
compevent_name = NULL,
compevent_model = NA,
compevent_cens = FALSE,
censor_name = NULL,
censor_model = NA,
intvars = NULL,
interventions = NULL,
int_times = NULL,
int_descript = NULL,
ref_int = 0,
intcomp = NA,
visitprocess = NA,
restrictions = NA,
yrestrictions = NA,
compevent_restrictions = NA,
baselags = FALSE,
nsimul = NA,
sim_data_b = FALSE,
seed,
nsamples = 0,
parallel = FALSE,
ncores = NA,
ci_method = "percentile",
threads,
model_fits = FALSE,
boot_diag = FALSE,
show_progress = TRUE,
ipw_cutoff_quantile = NULL,
ipw_cutoff_value = NULL,
int_visit_type = NULL,
...
)
Arguments
obs_data |
Data table containing the observed data. |
id |
Character string specifying the name of the ID variable in |
time_points |
Number of time points to simulate. By default, this argument is set equal to the maximum
number of records that |
time_name |
Character string specifying the name of the time variable in |
covnames |
Vector of character strings specifying the names of the time-varying covariates in |
covtypes |
Vector of character strings specifying the "type" of each time-varying covariate included in |
covparams |
List of vectors, where each vector contains information for
one parameter used in the modeling of the time-varying covariates (e.g.,
model statement, family, link function, etc.). Each vector
must be the same length as |
covfits_custom |
Vector containing custom fit functions for time-varying covariates that
do not fall within the pre-defined covariate types. It should be in
the same order |
covpredict_custom |
Vector containing custom prediction functions for time-varying
covariates that do not fall within the pre-defined covariate types.
It should be in the same order as |
histvars |
List of vectors. The kth vector specifies the names of the variables for which the kth history function
in |
histories |
Vector of history functions to apply to the variables specified in |
basecovs |
Vector of character strings specifying the names of baseline covariates in |
outcome_name |
Character string specifying the name of the outcome variable in |
outcome_type |
Character string specifying the "type" of outcome. The possible "types" are: |
ymodel |
Model statement for the outcome variable. |
compevent_name |
Character string specifying the name of the competing event variable in |
compevent_model |
Model statement for the competing event variable. The default is |
compevent_cens |
Logical scalar indicating whether to treat competing events as censoring events.
This argument is only applicable for survival outcomes and when a competing even model is supplied (i.e., |
censor_name |
Character string specifying the name of the censoring variable in |
censor_model |
Model statement for the censoring variable. Only applicable when using inverse probability weights to estimate the natural course means / risk from the observed data. See "Details". |
intvars |
List, whose elements are vectors of character strings. The kth vector in |
interventions |
List, whose elements are lists of vectors. Each list in |
int_times |
List, whose elements are lists of vectors. The kth list in |
int_descript |
Vector of character strings, each describing an intervention. It must
be in same order as the entries in |
ref_int |
Integer denoting the intervention to be used as the
reference for calculating the risk ratio and risk difference. 0 denotes the
natural course, while subsequent integers denote user-specified
interventions in the order that they are
named in |
intcomp |
List of two numbers indicating a pair of interventions to be compared by a hazard ratio.
The default is |
visitprocess |
List of vectors. Each vector contains as its first entry
the covariate name of a visit process; its second entry
the name of a covariate whose modeling depends on the
visit process; and its third entry the maximum number
of consecutive visits that can be missed before an
individual is censored. The default is |
restrictions |
List of vectors. Each vector contains as its first entry a covariate for which
a priori knowledge of its distribution is available; its second entry a condition
under which no knowledge of its distribution is available and that must be |
yrestrictions |
List of vectors. Each vector containins as its first entry
a condition and its second entry an integer. When the
condition is |
compevent_restrictions |
List of vectors. Each vector containins as its first entry
a condition and its second entry an integer. When the
condition is |
baselags |
Logical scalar for specifying the convention used for lagi and lag_cumavgi terms in the model statements when pre-baseline times are not
included in |
nsimul |
Number of subjects for whom to simulate data. By default, this argument is set
equal to the number of subjects in |
sim_data_b |
Logical scalar indicating whether to return the simulated data set. If bootstrap samples are used (i.e., |
seed |
Starting seed for simulations and bootstrapping. |
nsamples |
Integer specifying the number of bootstrap samples to generate. The default is 0. |
parallel |
Logical scalar indicating whether to parallelize simulations of different interventions to multiple cores. |
ncores |
Integer specifying the number of CPU cores to use in parallel
simulation. This argument is required when parallel is set to |
ci_method |
Character string specifying the method for calculating the bootstrap 95% confidence intervals, if applicable. The options are |
threads |
Integer specifying the number of threads to be used in |
model_fits |
Logical scalar indicating whether to return the fitted models. Note that if this argument is set to |
boot_diag |
Logical scalar indicating whether to return the parametric g-formula estimates as well as the coefficients, standard errors, and variance-covariance matrices of the parameters of the fitted models in the bootstrap samples. The default is |
show_progress |
Logical scalar indicating whether to print a progress bar for the number of bootstrap samples completed in the R console. This argument is only applicable when |
ipw_cutoff_quantile |
Percentile by which to truncate inverse probability weights. The default is |
ipw_cutoff_value |
Cutoff value by which to truncate inverse probability weights. The default is |
int_visit_type |
Vector of logicals. The kth element is a logical specifying whether to carry forward the intervened value (rather than the natural value) of the treatment variables(s) when performing a carry forward restriction type for the kth intervention in |
... |
Other arguments, which are passed to the functions in |
Details
To assess model misspecification in the parametric g-formula, users can obtain inverse probability (IP) weighted estimates of the natural course risk and/or means of the time-varying covariates from the observed data. See Chiu et al. (2023) for details. In addition to the general requirements described in McGrath et al. (2020), the requirements for the input data set and the call to the gformula function for such analyses are described below.
Users need to include a column in obs_data
with a time-varying censoring variable.
Users need to indicate the name of the censoring variable and a model statement for the censoring variable with parameters censor_name
and censor_model
, respectively.
When competing events are present, users need to include a column in obs_data
with a time-varying indicator of the competing event variable and need to indicate the name of the competing event variable and the corresponding model statement with parameters compevent_name
and compevent_model
, respectively.
Users need to indicate whether to treat competing events as censoring events with the compevent_cens
parameter.
Finally, users can specify how to truncate IP weights with the ipw_cutoff_quantile
or ipw_cutoff_value
parameters.
In addition to the package output described in McGrath et al. (2020), the output will display estimates of the "cumulative percent intervened on" and the "average percent intervened on". When using a custom intervention function, users need to specify whether each individual at that time point is eligible to contribute person-time to the percent intervened on calculations. Specifically, this must be specified in the eligible_pt
column of newdf
. By default, eligible_pt
is set to TRUE
for each individual at each time point in custom interventions.
Value
An object of class gformula_survival
. The object is a list with the following components:
result |
Results table. For survival outcomes, this contains the estimated risk, risk difference, and risk ratio for all interventions (inculding the natural course) at each time point. For continuous end-of-follow-up outcomes, this contains estimated mean outcome, mean difference, and mean ratio for all interventions (inculding natural course) at the last time point. For binary end-of-follow-up outcomes, this contains the estimated outcome probability, probability difference, and probability ratio for all interventions (inculding natural course) at the last time point. For all outcome types, this also contains the "cumulative percent intervened on" and the "average percent intervened on". If bootstrapping was used, the results table includes the bootstrap risk / mean / probability difference, ratio, standard error, and 95% confidence interval. |
coeffs |
A list of the coefficients of the fitted models. |
stderrs |
A list of the standard errors of the coefficients of the fitted models. |
vcovs |
A list of the variance-covariance matrices of the parameters of the fitted models. |
rmses |
A list of root mean square error (RMSE) values of the fitted models. |
hazardratio_val |
Hazard ratio between two interventions (if applicable). |
fits |
A list of the fitted models for the time-varying covariates, outcome, and competing event (if applicable). If |
sim_data |
A list of data tables of the simulated data. Each element in the list corresponds to one of the interventions. If the argument |
IP_weights |
A numeric vector specifying the inverse probability weights. See "Details". |
bootests |
A data.table containing the bootstrap replicates of the parametric g-formula estimates. If |
bootcoeffs |
A list, where the kth element is a list containing the coefficients of the fitted models corresponding to the kth bootstrap sample. If |
bootstderrs |
A list, where the kth element is a list containing the standard errors of the coefficients of the fitted models corresponding to the kth bootstrap sample. If |
bootvcovs |
A list, where the kth element is a list containing the variance-covariance matrices of the parameters of the fitted models corresponding to the kth bootstrap sample. If |
... |
Some additional elements. |
The results for the g-formula simulation are printed with the print.gformula_survival
, print.gformula_continuous_eof
, and print.gformula_binary_eof
functions. To generate graphs comparing the mean estimated covariate values and risks over time and mean observed covariate values and risks over time, use the plot.gformula_survival
, plot.gformula_continuous_eof
, and plot.gformula_binary_eof
functions.
References
Chiu YH, Wen L, McGrath S, Logan R, Dahabreh IJ, Hernán MA. Evaluating model specification when using the parametric g-formula in the presence of censoring. American Journal of Epidemiology. 2023;192:1887–1895.
McGrath S, Lin V, Zhang Z, Petito LC, Logan RW, Hernán MA, and JG Young. gfoRmula: An R package for estimating the effects of sustained treatment strategies via the parametric g-formula. Patterns. 2020;1:100008.
Robins JM. A new approach to causal inference in mortality studies with a sustained exposure period: application to the healthy worker survivor effect. Mathematical Modelling. 1986;7:1393–1512. [Errata (1987) in Computers and Mathematics with Applications 14, 917.-921. Addendum (1987) in Computers and Mathematics with Applications 14, 923-.945. Errata (1987) to addendum in Computers and Mathematics with Applications 18, 477.].
Examples
## Estimating the effect of static treatment strategies on risk of a
## failure event
id <- 'id'
time_points <- 7
time_name <- 't0'
covnames <- c('L1', 'L2', 'A')
outcome_name <- 'Y'
outcome_type <- 'survival'
covtypes <- c('binary', 'bounded normal', 'binary')
histories <- c(lagged, lagavg)
histvars <- list(c('A', 'L1', 'L2'), c('L1', 'L2'))
covparams <- list(covmodels = c(L1 ~ lag1_A + lag_cumavg1_L1 + lag_cumavg1_L2 +
L3 + t0,
L2 ~ lag1_A + L1 + lag_cumavg1_L1 +
lag_cumavg1_L2 + L3 + t0,
A ~ lag1_A + L1 + L2 + lag_cumavg1_L1 +
lag_cumavg1_L2 + L3 + t0))
ymodel <- Y ~ A + L1 + L2 + L3 + lag1_A + lag1_L1 + lag1_L2 + t0
intvars <- list('A', 'A')
interventions <- list(list(c(static, rep(0, time_points))),
list(c(static, rep(1, time_points))))
int_descript <- c('Never treat', 'Always treat')
nsimul <- 10000
gform_basic <- gformula(obs_data = basicdata_nocomp, id = id,
time_points = time_points,
time_name = time_name, covnames = covnames,
outcome_name = outcome_name,
outcome_type = outcome_type, covtypes = covtypes,
covparams = covparams, ymodel = ymodel,
intvars = intvars,
interventions = interventions,
int_descript = int_descript,
histories = histories, histvars = histvars,
basecovs = c('L3'), nsimul = nsimul,
seed = 1234)
gform_basic
## Estimating the effect of treatment strategies on risk of a failure event
## when competing events exist
id <- 'id'
time_points <- 7
time_name <- 't0'
covnames <- c('L1', 'L2', 'A')
outcome_name <- 'Y'
compevent_name <- 'D'
outcome_type <- 'survival'
covtypes <- c('binary', 'bounded normal', 'binary')
histories <- c(lagged, lagavg)
histvars <- list(c('A', 'L1', 'L2'), c('L1', 'L2'))
covparams <- list(covlink = c('logit', 'identity', 'logit'),
covmodels = c(L1 ~ lag1_A + lag_cumavg1_L1 + lag_cumavg1_L2 +
L3 + as.factor(t0),
L2 ~ lag1_A + L1 + lag_cumavg1_L1 +
lag_cumavg1_L2 + L3 + as.factor(t0),
A ~ lag1_A + L1 + L2 + lag_cumavg1_L1 +
lag_cumavg1_L2 + L3 + as.factor(t0)))
ymodel <- Y ~ A + L1 + L2 + lag1_A + lag1_L1 + lag1_L2 + L3 + as.factor(t0)
compevent_model <- D ~ A + L1 + L2 + lag1_A + lag1_L1 + lag1_L2 + L3 + as.factor(t0)
intvars <- list('A', 'A')
interventions <- list(list(c(static, rep(0, time_points))),
list(c(static, rep(1, time_points))))
int_descript <- c('Never treat', 'Always treat')
nsimul <- 10000
gform_basic <- gformula(obs_data = basicdata, id = id,
time_points = time_points,
time_name = time_name, covnames = covnames,
outcome_name = outcome_name,
outcome_type = outcome_type,
compevent_name = compevent_name,
covtypes = covtypes,
covparams = covparams, ymodel = ymodel,
compevent_model = compevent_model,
intvars = intvars, interventions = interventions,
int_descript = int_descript,
histories = histories, histvars = histvars,
basecovs = c('L3'), nsimul = nsimul,
seed = 1234)
gform_basic
## Estimating the effect of treatment strategies on the mean of a continuous
## end of follow-up outcome
library('Hmisc')
id <- 'id'
time_name <- 't0'
covnames <- c('L1', 'L2', 'A')
outcome_name <- 'Y'
outcome_type <- 'continuous_eof'
covtypes <- c('categorical', 'normal', 'binary')
histories <- c(lagged)
histvars <- list(c('A', 'L1', 'L2'))
covparams <- list(covmodels = c(L1 ~ lag1_A + lag1_L1 + L3 + t0 +
rcspline.eval(lag1_L2, knots = c(-1, 0, 1)),
L2 ~ lag1_A + L1 + lag1_L1 + lag1_L2 + L3 + t0,
A ~ lag1_A + L1 + L2 + lag1_L1 + lag1_L2 + L3 + t0))
ymodel <- Y ~ A + L1 + L2 + lag1_A + lag1_L1 + lag1_L2 + L3
intvars <- list('A', 'A')
interventions <- list(list(c(static, rep(0, 7))),
list(c(static, rep(1, 7))))
int_descript <- c('Never treat', 'Always treat')
nsimul <- 10000
gform_cont_eof <- gformula(obs_data = continuous_eofdata,
id = id, time_name = time_name,
covnames = covnames, outcome_name = outcome_name,
outcome_type = outcome_type, covtypes = covtypes,
covparams = covparams, ymodel = ymodel,
intvars = intvars, interventions = interventions,
int_descript = int_descript,
histories = histories, histvars = histvars,
basecovs = c("L3"), nsimul = nsimul, seed = 1234)
gform_cont_eof
## Estimating the effect of threshold interventions on the mean of a binary
## end of follow-up outcome
outcome_type <- 'binary_eof'
id <- 'id_num'
time_name <- 'time'
covnames <- c('cov1', 'cov2', 'treat')
outcome_name <- 'outcome'
histories <- c(lagged, cumavg)
histvars <- list(c('treat', 'cov1', 'cov2'), c('cov1', 'cov2'))
covtypes <- c('binary', 'zero-inflated normal', 'normal')
covparams <- list(covmodels = c(cov1 ~ lag1_treat + lag1_cov1 + lag1_cov2 +
cov3 + time,
cov2 ~ lag1_treat + cov1 + lag1_cov1 +
lag1_cov2 + cov3 + time,
treat ~ lag1_treat + cumavg_cov1 +
cumavg_cov2 + cov3 + time))
ymodel <- outcome ~ treat + cov1 + cov2 + lag1_cov1 + lag1_cov2 + cov3
intvars <- list('treat', 'treat')
interventions <- list(list(c(static, rep(0, 7))),
list(c(threshold, 1, Inf)))
int_descript <- c('Never treat', 'Threshold - lower bound 1')
nsimul <- 10000
ncores <- 2
gform_bin_eof <- gformula(obs_data = binary_eofdata,
outcome_type = outcome_type, id = id,
time_name = time_name, covnames = covnames,
outcome_name = outcome_name, covtypes = covtypes,
covparams = covparams, ymodel = ymodel,
intvars = intvars, interventions = interventions,
int_descript = int_descript, histories = histories,
histvars = histvars, basecovs = c("cov3"),
seed = 1234, parallel = TRUE, nsamples = 5,
nsimul = nsimul, ncores = ncores)
gform_bin_eof
## Using IP weighting to estimate natural course risk
## Only the natural course intervention is included for simplicity
covnames <- c('L', 'A')
histories <- c(lagged)
histvars <- list(c('A', 'L'))
ymodel <- Y ~ L + A
covtypes <- c('binary', 'normal')
covparams <- list(covmodels = c(L ~ lag1_L + lag1_A,
A ~ lag1_L + L + lag1_A))
censor_name <- 'C'
censor_model <- C ~ L
res_censor <- gformula(obs_data = censor_data, id = 'id',
time_name = 't0', covnames = covnames,
outcome_name = 'Y', outcome_type = 'survival',
censor_name = censor_name, censor_model = censor_model,
covtypes = covtypes,
covparams = covparams, ymodel = ymodel,
intvars = NULL, interventions = NULL, int_descript = NULL,
histories = histories, histvars = histvars,
seed = 1234)
plot(res_censor)