R: Fit a joint model to time-to-event and multivariate...

joint {gmvjoint}

R Documentation

Fit a joint model to time-to-event and multivariate longitudinal data

Description

Fit a joint model to time-to-event and multivariate longitudinal data

Usage

joint(
  long.formulas,
  surv.formula,
  data,
  family,
  disp.formulas = NULL,
  control = list()
)

Arguments

`long.formulas`	A list of formula objects specifying the `K` responses. Each must be usable by `glmmTMB`. A restriction is that unique identifiers must be named `id`, and increment in intervals of at exactly one. The variable for time must be named `time`.
`surv.formula`	A formula specifying the time-to-event sub-model. Must be usable by `coxph`.
`data`	A `data.frame` containing all covariates and responses.
`family`	A list of length `K` containing strings denoting the exponential families for each longitudinal sub-model, corresponding in order to `long.formulas`. For choices of `family`, see details.
`disp.formulas`	An optional list of length `K` specifying the dispersion models wanted for each longitudinal sub-model, corresponding in order to `long.formulas`. Defaults to `disp.formulas = NULL`. See details for more information.
`control`	A list of control values: `verbose` Logical: If `TRUE`, at each iteration parameter information will be printed to console. Default is `verbose=FALSE`. `conv` Character: Convergence criterion, see details. `tol.abs` Numeric: Tolerance value used to assess convergence, see details. Default is `tol.abs=1e-3`. `tol.rel` Numeric: Tolerance value used to assess convergence, see details. Default is `tol.rel=1e-2`. `tol.den` Numeric: Tolerance value used to assess convergence, see details. Default is `tol.den=1e-3`. `tol.thr` Numeric: Threshold used when `conv = 'sas'`, see details. Default is `tol.thr=1e-1`. `grad.eps` Numeric: Step size for numerical differentiation routines used to calculate the gradient in updates to dispersion parameters. This defaults to the cube root of machine tolerance. If a different step size is wanted for each response, a list can also be provided, with each of its elements corresponding to each longitudinal response (even if not fitted with a dispersion model). `hess.eps` Numeric: Step size for numerical differentiation routines used to calculate the hessian in updates to dispersion parameters. This defaults to the fourth root of machine tolerance. Behaves in same way as `grad.eps` for more information. `inits` List: list of initial conditions, any/all of the following can be specified (largely for bootstrapping purposes). Accepts elements named: `D`, which should be an appropriately-dimensioned variance-covariance matrix; `beta`, a vector containing all fixed effects; `sigma` a list containing all dispersion parameters, with non-applicable elements set to zero; `gamma` a vector containing all association parameters; `zeta` a vector containing the time-invariant survival coefficients. `maxit` Integer: Maximum number of EM iterations to carry out before exiting the algorithm. Defaults to `maxit=200L`, which is usually sufficient. `correlated` Logical: Should covariance parameters between responses be estimated and used in determination of model convergence? Default is `correlated=TRUE`. A choice of `correlated=FALSE` is equivalent to imposing the belief that deviations in longitudinal trajectories are not correlated across responses, but can decrease computation time, particularly for large `K`. `gh.nodes` Integer: Number of weights and abscissae to use in gauss-hermite quadrature. Defaults to `gh.nodes=3`, which is usually sufficient. `gh.sigma` Numeric: Standard deviation for gauss-hermite approximation of normal distribution. Defaults to `gh.sigma=1`. This should rarely (if ever) need altering. `return.dmats` Logical: Should data matrices be returned? Defaults to `return.dmats=TRUE`. Note that some S3 methods for `joint.object`s require the returned object to include these data matrices. `return.inits` Logical: Should a list of inital conditons be returned? Defaults to `return.inits=FALSE`. `center.ph` Logical: Should the survival covariates be mean-centered? Defaults to `center.ph=TRUE`. `post.process` Logical: Should model post-processing be carried out (assumes that the model has converged). Defaults to `post.process = TRUE` which then returns posterior modes and their variance for the random effects, as well as approximated standard error. This is largely for internal use (i.e. if bootstrapping to obtain SEs instead).

Details

Function joint fits a joint model to time-to-event data and multivariate longitudinal data. The longitudinal data can be specified by numerous models encompassing a fairly wide range of data. This joint model fit is achieved by the use of an approximate EM algorithm first proposed in Bernhardt et al. (2015), and later used in the 'classic' multivariate joint model in Murray and Philipson (2022). Each longitudinal response is modelled by

h_k(E[Y_{ik}|b_{ik};\Omega]) = X_{ik}\beta_k + Z_{ik}b_{ik}

where h_k is a known, monotonic link function. An association is induced between the Kth response and the hazard \lambda_i(t) by:

\lambda_i(t)=\lambda_0(t)\exp\{S_i^T\zeta + \sum_{k=1}^K\gamma_kW_k(t)^Tb_{ik}\}

where \gamma_k is the association parameter and W_k(t) is the vector function of time imposed on the Kth random effects structure (i.e. intercept-and-slope; spline).

Value

An object with class joint. See joint.object for information.

Family specification

Currently, five families are available for implementation, spanning continuous, binary and count data types:

'gaussian': Normally distributed. The identity link is used. A term \sigma_k will be estimated, denoting the variance of this response
'binomial': For binary data types, a logit link is used.
'poisson': For count data types where dispersion is either non-consequential or ignored. A log link is used.
'genpois': For count data types where dispersion is at least of some secondary interest. A log link is used. A term \sigma_k is estimated, denoting the dispersion, \varphi of the response. This follows interpretation of Zamani & Ismail (2012): \varphi>0: Over-dispersion; \varphi<0: Under-dispersion. Var[Y]=(1+\varphi)^2\mu.
'Gamma': For continuous data where a Gamma distribution might be sensible. The log link is used. A term \sigma_k is be estimated, denoting the (log) shape of the distribution, which is then reported as \varphi_k=\exp\{\sigma_k\}.
"negbin": For count data types where overdispersion is modelled. A log link is used. A term \sigma_k is estimated, which is then reported as \varphi_k=\exp\{\sigma_k\} which is the overdispersion. The variance of the response is Var[Y]=\mu+\mu^2/\varphi.

For families "negbin", "Gamma", "genpois", the user can define the dispersion model desired in disp.formulas. For the "negbin" and "Gamma" cases, we define \varphi_i=\exp\{W_i\sigma_i\} (i.e. the exponent of the linear predictor of the dispersion model; and for "genpois" the identity of the linear is used.

Dispersion models

The disp.formulas in the function call allows the user to model the dispersion for a given sub-model if wanted. The default value disp.formulas = NULL simply imposes an 'intercept only' model. If the kth item in disp.formulas corresponds to a longitudinal sub-model with no dispersion term, then it is simply ignored. With this in mind then, if a dispersion model is only required for, say, one sub-model, then the corresponding item in this list of models should be specified as such, with the others set to ~1.

Standard error estimation

We follow the approximation of the observed empirical information matrix detailed by Mclachlan and Krishnan (2008), and later used in joineRML (Hickey et al., 2018). These are only calculated if post.process=TRUE. Generally, these SEs are well-behaved, but their reliability will depend on multiple factors: Sample size; number of events; collinearity of REs of responses; number of observed times, and so on. Some more discussion/ references are given in vcov.joint.

Convergence of the algorithm

A few convergence criteria (specified by control$conv) are available:

abs: Convergence reached when maximum absolute change in parameter estimates is <tol.abs.
rel: Convergence reached when maximum absolute relative change in parameter estimates is <tol.rel. A small amount (tol.den) is added to the denominator to eschew numerical issues if parameters are nearly zero.
either: Convergence is reached when either abs or rel are met.
sas: Assess convergence for parameters |\Omega_a|<tol.thr by the abs criterion, else rel. This is the default.

Note that the baseline hazard is updated at each EM iteration, but is not monitored for convergence.

Author(s)

James Murray (j.murray7@ncl.ac.uk).

References

Bernhardt PW, Zhang D and Wang HJ. A fast EM Algorithm for Fitting Joint Models of a Binary Response to Multiple Longitudinal Covariates Subject to Detection Limits. Computational Statistics and Data Analysis 2015; 85; 37–53

Hickey GL, Philipson P, Jorgensen A, Kolamunnage-Dona R. joineRML: a joint model and software package for time-to-event and multivariate longitudinal outcomes. BMC Med. Res. Methodol. 2018; 50

McLachlan GJ, Krishnan T. The EM Algorithm and Extensions. Second Edition. Wiley-Interscience; 2008.

Murray, J and Philipson P. A fast approximate EM algorithm for joint models of survival and multivariate longitudinal data.Computational Statistics and Data Analysis 2022; 170; 107438

Zamani H and Ismail N. Functional Form for the Generalized Poisson Regression Model, Communications in Statistics - Theory and Methods 2012; 41(20); 3666-3675.

Examples


# 1) Fit on simulated bivariate data, (1x gaussian, 1x poisson) --------
beta <- do.call(rbind, replicate(2, c(2, -0.1, 0.1, -0.2), simplify = FALSE))
gamma <- c(0.3, -0.3)
D <- diag(c(0.25, 0.09, 0.25, 0.05))
family <- list('gaussian', 'poisson')
data <- simData(ntms = 10, beta = beta, D = D, n = 100,
                family = family, zeta = c(0, -0.2),
                sigma = list(0.16, 0), gamma = gamma)$data

# Specify formulae and target families
long.formulas <- list(
  Y.1 ~ time + cont + bin + (1 + time|id),  # Gaussian
  Y.2 ~ time + cont + bin + (1 + time|id)   # Poisson
)
surv.formula <- Surv(survtime, status) ~ bin

fit <- joint(long.formulas, surv.formula, data, family)


# 2) Fit on PBC data -----------------------------------------------------
data(PBC)
# Subset data and remove NAs
PBC <- subset(PBC, select = c('id', 'survtime', 'status', 'drug', 'time',
                              'serBilir', 'albumin', 'spiders', 'platelets'))
PBC <- na.omit(PBC) 

# Specify GLMM sub-models, including interaction and quadratic time terms
long.formulas <- list(
  log(serBilir) ~ drug * (time + I(time^2)) + (1 + time + I(time^2)|id),
  albumin ~ drug * time + (1 + time|id),
  platelets ~ drug * time + (1 + time|id),
  spiders ~ drug * time + (1|id)
)
surv.formula <- Surv(survtime, status) ~ drug

fit <-  joint(long.formulas, surv.formula, PBC, 
              family = list("gaussian", "gaussian", "poisson", "binomial"),
              control = list(verbose = TRUE))
fit


# 3) Fit with dispersion models ----------------------------------------
beta <- do.call(rbind, replicate(2, c(2, -0.1, 0.1, -0.2), simplify = FALSE))
gamma <- c(0.3, -0.3)
D <- diag(c(0.25, 0.09, 0.25, 0.05))
family <- list('negbin', 'poisson')   # As an example; only requires one dispersion model.
sigma <- list(c(1, 0.2), 0)           # Need to specify the model in simData call too.
disp.formulas = list(~time, ~1)       # Even though poisson doesn't model dispersion, need to
                                      # populate this element in disp.formulas!
# Simulate some data
data <- simData(ntms = 10, beta = beta, D = D, n = 500,
                family = family, zeta = c(0, -0.2), sigma = sigma,
                disp.formulas = disp.formulas, gamma = gamma)$data

# Now fit using joint
long.formulas <- list(
  Y.1 ~ time + cont + bin + (1+time|id),
  Y.2 ~ time + cont + bin + (1+time|id)
)
fit <- joint(
  long.formulas, Surv(survtime, status) ~ bin,
  data, family, disp.formulas = disp.formulas
)
fit
summary(fit)

[Package gmvjoint version 0.4.0 Index]

Fit a joint model to time-to-event and multivariate longitudinal data

Description

Usage

Arguments

Details

Value

Family specification

Dispersion models

Standard error estimation

Convergence of the algorithm

Author(s)

References

See Also

Examples