R: Simulate data from a multivariate joint model

simData {gmvjoint}

R Documentation

Simulate data from a multivariate joint model

Description

Simulate multivariate longitudinal and survival data from a joint model specification, with potential mixture of response families. Implementation is similar to existing packages (e.g. joineR, joineRML).

Usage

simData(
  n = 250,
  ntms = 10,
  fup = 5,
  family = list("gaussian", "gaussian"),
  sigma = list(0.16, 0.16),
  beta = rbind(c(1, 0.1, 0.33, -0.5), c(1, 0.1, 0.33, -0.5)),
  D = NULL,
  gamma = c(0.5, -0.5),
  zeta = c(0.05, -0.3),
  theta = c(-4, 0.2),
  cens.rate = exp(-3.5),
  regular.times = TRUE,
  dof = Inf,
  random.formulas = NULL,
  disp.formulas = NULL,
  return.ranefs = FALSE
)

Arguments

`n`	the number of subjects
`ntms`	the number of time points
`fup`	the maximum follow-up time, such that t = [0, ..., fup] with length `ntms`. In instances where subject `i` doesn't fail before `fup`, their censoring time is set as `fup + 0.1`.
`family`	a `K`-`list` of families, see details.
`sigma`	a `K`-`list` of dispersion parameters corresponding to the order of `family`, and matching `disp.formulas` specification; see details.
`beta`	a `K \times 4` matrix specifying fixed effects for each `K` parameter, in the order (Intercept), time, continuous, binary.
`D`	a positive-definite matrix specifying the variance-covariance matrix for the random effects. If not supplied an identity matrix is assumed.
`gamma`	a `K`-vector specifying the association parameters for each longitudinal outcome.
`zeta`	a vector of length 2 specifying the coefficients for the baseline covariates in the survival sub-model, in the order of continuous and binary.
`theta`	parameters to control the failure rate, see baseline hazard.
`cens.rate`	parameter for `rexp` to generate censoring times for each subject.
`regular.times`	logical, if `regular.times = TRUE` (the default), then every subject will have the same follow-up times defined by `seq(0, fup, length.out = ntms)`. If `regular.times = FALSE` then follow-up times are set as random draws from a uniform distribution with maximum `fup`.
`dof`	integer, specifies the degrees of freedom of the multivariate t-distribution used to generate the random effects. If specified, this t-distribution is used. If left at the default `dof=Inf` then the random effects are drawn from a multivariate normal distribution.
`random.formulas`	allows user to specify if an intercept-and-slope (`~ time`) or intercept-only (`~1`) random effects structure should be used on a response-by-response basis. Defaults to an intercept-and-slope for all responses.
`disp.formulas`	allows user to specify the dispersion model simulated. Intended use is to allow swapping between intercept only (the default) and a time-varying one (`~ time`). Note that this should be a `K`-`list` of formula objects, so if only one dispersion model is wanted, then an intercept-only should be specified for remaining sub-models. The corresponding item in list of `sigma` parameters should be of appropriate size. Defaults to an intercept-only model.
`return.ranefs`	a logical determining whether the true random effects should be returned. This is largely for internal/simulation use. Default `return.ranefs = FALSE`.

Details

simData simulates data from a multivariate joint model with a mixture of families for each k=1,\dots,K response. The specification of family changes requisite dispersion parameter sigma, if applicable. The family list can (currently) contain:

"gaussian": Simulated with identity link, corresponding item in sigma will be the variance.
"poisson": Simulated with log link, corresponding dispersion in sigma can be anything, as it doesn't impact simulation.
"binomial": Simulated with logit link, corresponding dispersion in sigma can be anything, as it doesn't impact simulation.
"negbin": Simulated with a log link, corresponding item in sigma will be the overdispersion defined on the log scale. Simulated variance is \mu+\mu^2/\varphi.
"genpois": Simulated with a log link, corresponding item in sigma will be the dispersion. Values < 0 correspond to under-dispersion, and values > 0 over- dispersion. See rgenpois for more information. Simulated variance is (1+\varphi)^2\mu.
"Gamma": Simulated with a log link, corresponding item in sigma will be the shape parameter, defined on the log-scale.

Therefore, for families "negbin", "Gamma", "genpois", the user can define the dispersion model desired in disp.formulas, which creates a data matrix W. For the "negbin" and "Gamma" cases, we define \varphi_i=\exp\{W_i\sigma_i\} (i.e. the exponent of the linear predictor of the dispersion model); and for "genpois" the identity of the linear is used.

Value

A list of two data.frames: One with the full longitudinal data, and another with only survival data. If return.ranefs=TRUE, a matrix of the true b values is also returned. By default (i.e. no arguments provided), a bivariate Gaussian set of joint data is returned.

Baseline hazard

When simulating the survival time, the baseline hazard is a Gompertz distribution controlled by theta=c(x,y):

\lambda_0(t) = \exp{x + yt}

where y is the shape parameter, and the scale parameter is \exp{x}.

Author(s)

James Murray (j.murray7@ncl.ac.uk).

References

Austin PC. Generating survival times to simulate Cox proportional hazards models with time-varying covariates. Stat Med. 2012; 31(29): 3946-3958.

Examples

# 1) A set of univariate data ------------------------------------------
beta <- c(2.0, 0.33, -0.25, 0.15)
# Note that by default arguments are bivariate, so need to specify the univariate case
univ.data <- simData(beta = beta,    
                     gamma = 0.15, sigma = list(0.2), family = list("gaussian"), 
                     D = diag(c(0.25, 0.05)))
                     
# 2) Univariate data, with failure rate controlled ---------------------
# In reality, many facets contribute to the simulated failure rate, in 
# this example, we'll just atler the baseline hazard via 'theta'.
univ.data.highfail <- simData(beta = beta,
                              gamma = 0.15, sigma = list(0.0), family = list("poisson"),
                              D = diag(c(0.40, 0.08)), theta = c(-2, 0.1))

# 3) Trivariate (K = 3) mixture of families with dispersion parameters -
beta <- do.call(rbind, replicate(3, c(2, -0.1, 0.1, -0.2), simplify = FALSE))
gamma <- c(0.3, -0.3, 0.3)
D <- diag(c(0.25, 0.09, 0.25, 0.05, 0.25, 0.09))
family <- list('gaussian', 'genpois', 'negbin')
sigma <- list(.16, 1.5, log(1.5))
triv.data <- simData(ntms=15, family = family, sigma = sigma, beta = beta, D = D, 
                     gamma = gamma, theta = c(-3, 0.2), zeta = c(0,-.2))

# 4) K = 4 mixture of families with/out dispersion ---------------------
beta <- do.call(rbind, replicate(4, c(2, -0.1, 0.1, -0.2), simplify = FALSE))
gamma <- c(-0.75, 0.3, -0.6, 0.5)
D <- diag(c(0.25, 0.09, 0.25, 0.05, 0.25, 0.09, 0.16, 0.02))
family <- list('gaussian', 'poisson', 'binomial', 'gaussian')
sigma <- list(.16, 0, 0, .05) # 0 can be anything here, as it is ignored internally.
mix.data <- simData(ntms=15, family = family, sigma = sigma, beta = beta, D = D, gamma = gamma,
                    theta = c(-3, 0.2), zeta = c(0,-.2))
                    
# 5) Bivariate joint model with two dispersion models. -----------------
disp.formulas <- list(~time, ~time)          # Two time-varying dispersion models
sigma <- list(c(0.00, -0.10), c(0.10, 0.15)) # specified in form of intercept, slope
D <- diag(c(.25, 0.04, 0.50, 0.10))
disp.data <- simData(family = list("genpois", "negbin"), sigma = sigma, D = D,
                     beta = rbind(c(0, 0.05, -0.15, 0.00), 1 + c(0, 0.25, 0.15, -0.20)),
                     gamma = c(1.5, 1.5),
                     disp.formulas = disp.formulas, fup = 5)            

# 6) Trivariate joint model with mixture of random effects models ------
# It can be hard to e.g. fit a binomial model on an intercept and slope, since e.g.
# glmmTMB might struggle to accurately fit it (singular fits, etc.). To that end, could
# swap the corresponding random effects specification to be an intercept-only.
family <- list("gaussian", "binomial", "gaussian")
# A list of formulae, even though we want to change the second sub-model's specification
# we need to specify the rest of the items, too (same as disp.formulas, sigma).
random.formulas <- list(~time, ~1, ~time)
beta <- rbind(c(2, -0.2, 0.5, -0.25), c(0, 0.5, 1, -1), c(-2, 0.2, -0.5, 0.25))
# NOTE that the specification of RE matrix will need to match.
D <- diag(c(0.25, 0.09, 1, 0.33, 0.05))
# Simulate data, and return REs as a sanity check...
mix.REspec.data <- simData(beta = beta, D = D, family = family,
                           gamma = c(-0.5, 1, 0.5), sigma = list(0.15, 0, 0.15),
                           random.formulas = random.formulas, return.ranefs = TRUE)

[Package gmvjoint version 0.4.0 Index]