R: Population-averaged accelerated failure time model for...

bayesBisurvreg {bayesSurv}

R Documentation

Population-averaged accelerated failure time model for bivariate, possibly doubly-interval-censored data. The error distribution is expressed as a penalized bivariate normal mixture with high number of components (bivariate G-spline).

Description

A function to estimate a regression model with bivariate (possibly right-, left-, interval- or doubly-interval-censored) data. In the case of doubly interval censoring, different regression models can be specified for the onset and event times.

The error density of the regression model is specified as a mixture of Bayesian G-splines (normal densities with equidistant means and constant variance matrices). This function performs an MCMC sampling from the posterior distribution of unknown quantities.

For details, see Komárek (2006) and Komárek and Lesaffre (2006).

We explain first in more detail a model without doubly censoring. Let T_{i,l},\; i=1,\dots, N,\; l=1, 2 be event times for ith cluster and the first and the second unit. The following regression model is assumed:

\log(T_{i,l}) = \beta'x_{i,l} + \varepsilon_{i,l},\quad i=1,\dots,N,\;l=1,2

where \beta is unknown regression parameter vector and x_{i,l} is a vector of covariates. The bivariate error terms \varepsilon_i=(\varepsilon_{i,1},\,\varepsilon_{i,2})',\;i=1,\dots,N are assumed to be i.i.d. with a bivariate density g_{\varepsilon}(e_1,\,e_2). This density is expressed as a mixture of Bayesian G-splines (normal densities with equidistant means and constant variance matrices). We distinguish two, theoretically equivalent, specifications.

Specification 1

(\varepsilon_1,\,\varepsilon_2)' \sim \sum_{j_1=-K_1}^{K_1}\sum_{j_2=-K_2}^{K_2} w_{j_1,j_2} N_2(\mu_{(j_1,j_2)},\,\mbox{diag}(\sigma_1^2,\,\sigma_2^2))

where \sigma_1^2,\,\sigma_2^2 are unknown basis variances and \mu_{(j_1,j_2)} = (\mu_{1,j_1},\,\mu_{2,j_2})' is an equidistant grid of knots symmetric around the unknown point (\gamma_1,\,\gamma_2)' and related to the unknown basis variances through the relationship

\mu_{1,j_1} = \gamma_1 + j_1\delta_1\sigma_1,\quad j_1=-K_1,\dots,K_1,

\mu_{2,j_2} = \gamma_2 + j_2\delta_2\sigma_2,\quad j_2=-K_2,\dots,K_2,

where \delta_1,\,\delta_2 are fixed constants, e.g. \delta_1=\delta_2=2/3 (which has a justification of being close to cubic B-splines).

Specification 2

(\varepsilon_1,\,\varepsilon_2)' \sim (\alpha_1,\,\alpha_2)'+ \bold{S}\,(V_1,\,V_2)'

where (\alpha_1,\,\alpha_2)' is an unknown intercept term and \bold{S} \mbox{ is a diagonal matrix with } \tau_1 \mbox{ and }\tau_2 \mbox{ on a diagonal,} i.e. \tau_1,\,\tau_2 are unknown scale parameters. (V_1,\,V_2)') is then standardized bivariate error term which is distributed according to the bivariate normal mixture, i.e.

(V_1,\,V_2)'\sim \sum_{j_1=-K_1}^{K_1}\sum_{j_2=-K_2}^{K_2} w_{j_1,j_2} N_2(\mu_{(j_1,j_2)},\,\mbox{diag}(\sigma_1^2, \sigma_2^2))

where \mu_{(j_1,j_2)} = (\mu_{1,j_1},\,\mu_{2,j_2})' is an equidistant grid of fixed knots (means), usually symmetric about the fixed point (\gamma_1,\,\gamma_2)'=(0, 0)' and \sigma_1^2,\,\sigma_2^2 are fixed basis variances. Reasonable values for the numbers of grid points K_1 and K_2 are K_1=K_2=15 with the distance between the two knots equal to \delta=0.3 and for the basis variances \sigma_1^2\sigma_2^2=0.2^2.

Personally, I found Specification 2 performing better. In the paper Komárek and Lesaffre (2006) only Specification 2 is described.

The mixture weights w_{j_1,j_2},\;j_1=-K_1,\dots, K_1,\;j_2=-K_2,\dots, K_2 are not estimated directly. To avoid the constraints 0 < w_{j_1,j_2} < 1 and \sum_{j_1=-K_1}^{K_1}\sum_{j_2=-K_2}^{K_2}w_{j_1,j_2} = 1 transformed weights a_{j_1,j_2},\;j_1=-K_1,\dots, K_1,\;j_2=-K_2,\dots, K_2 related to the original weights by the logistic transformation:

a_{j_1,j_2} = \frac{\exp(w_{j_1,j_2})}{\sum_{m_1}\sum_{m_2}\exp(w_{m_1,m_2})}

are estimated instead.

A Bayesian model is set up for all unknown parameters. For more details I refer to Komárek and Lesaffre (2006) and to Komárek (2006).

If there are doubly-censored data the model of the same type as above can be specified for both the onset time and the time-to-event.

Usage

bayesBisurvreg(formula, formula2, data = parent.frame(),
   na.action = na.fail, onlyX = FALSE,
   nsimul = list(niter = 10, nthin = 1, nburn = 0, nwrite = 10),
   prior, prior.beta, init = list(iter = 0),
   mcmc.par = list(type.update.a = "slice", k.overrelax.a = 1,
                   k.overrelax.sigma = 1, k.overrelax.scale = 1),
   prior2, prior.beta2, init2,
   mcmc.par2 = list(type.update.a = "slice", k.overrelax.a = 1,
                    k.overrelax.sigma = 1, k.overrelax.scale = 1),
   store = list(a = FALSE, a2 = FALSE, y = FALSE, y2 = FALSE,
                r = FALSE, r2 = FALSE),
   dir)

Arguments

`formula`	model formula for the regression. In the case of doubly-censored data, this is the model formula for the onset time. Data are assumed to be sorted according to subjects and within subjects according to the types of the events that determine the bivariate survival distribution, i.e. the response vector must be `t_{1,1}, t_{1,2}, t_{2,1}, t_{2,2}, t_{3,1}, t_{3,2}, \dots, t_{n,1}, t_{n,2}`. The rows of the design matrix with covariates must be sorted analogically. The left-hand side of the formula must be an object created using `Surv`.
`formula2`	model formula for the regression of the time-to-event in the case of doubly-censored data. Ignored otherwise. The same remark as for `formula` concerning the sort order applies here.
`data`	optional data frame in which to interpret the variables occuring in the formulas.
`na.action`	the user is discouraged from changing the default value `na.fail`.
`onlyX`	if `TRUE` no MCMC sampling is performed and only the design matrix (matrices) are returned. This can be useful to set up correctly priors for regression parameters in the presence of `factor` covariates.
`nsimul`	a list giving the number of iterations of the MCMC and other parameters of the simulation. niter total number of sampled values after discarding thinned ones, burn-up included; nthin thinning interval; nburn number of sampled values in a burn-up period after discarding thinned values. This value should be smaller than `niter`. If not, `nburn` is set to `niter - 1`. It can be set to zero; nwrite an interval at which information about the number of performed iterations is print on the screen and during the burn-up period an interval with which the sampled values are writen to files;
`prior`	a list specifying the prior distribution of the G-spline defining the distribution of the error term in the regression model given by `formula`. See `prior` argument of `bayesHistogram` function for more detail. In this list also ‘Specification’ as described above is specified.
`prior.beta`	prior specification for the regression parameters, in the case of doubly censored data for the regression parameters of the onset time. I.e. it is related to `formula`. This should be a list with the following components: mean.prior a vector specifying a prior mean for each `beta` parameter in the model. var.prior a vector specifying a prior variance for each `beta` parameter. It is recommended to run the function bayesBisurvreg first with its argument `onlyX` set to `TRUE` to find out how the betas are sorted. They must correspond to a design matrix X taken from `formula`.
`init`	an optional list with initial values for the MCMC related to the model given by `formula`. The list can have the following components: iter the number of the iteration to which the initial values correspond, usually zero. beta a vector of initial values for the regression parameters. It must be sorted in the same way as are the columns in the design matrix. Use `onlyX=TRUE` if you do not know how the columns in the design matrix are created. a a matrix of size `(2K_1+1)\times(2K_2+1)` with the initial values of transformed mixture weights. lambda initial values for the Markov random fields precision parameters. According to the chosen prior for the transformed mixture weights, this is either a number or a vector of length 2. gamma a vector of length 2 of initial values for the middle knots `\gamma_1, \gamma_2` in each dimension. If ‘Specification’ is 2, this value will not be changed by the MCMC and it is recommended (for easier interpretation of the results) to set `init$gamma` to zero for all dimensions (default behavior). If ‘Specification’ is 1 `init$gamma` should be approximately equal to the mean value of the residuals in each margin. sigma a vector of length 2 of initial values of the basis standard deviations `\sigma_1, \sigma_2`. If ‘Specification’ is 2 this value will not be changed by the MCMC and it is recommended to set it approximately equal to the range of standardized data (let say 4 + 4) divided by the number of knots in each margin and multiplied by something like 2/3. If ‘Specification’ is 1 this should be approximately equal to the range of the residuals divided by the number of knots in each margin and multiplied again by something like 2/3. intercept a vector of length 2 of initial values of the intercept terms `\alpha_1, \alpha_2`. If ‘Specification’ is 1 this value is not changed by the MCMC and the initial value is always changed to zero for both dimensions. scale a vector of length 2 of initial values of the scale parameters `\tau_1, \tau_2`. If ‘Specification’ is 1 this value is not changed by the MCMC and the initial value is always changed to one for both dimensions. y a matrix with 2 columns and `N` rows with initial values of log-event-times for each cluster in rows. r a matrix with 2 columns and `N` rows with initial component labels for each bivariate residual in rows. All values in the first column must be between `-K_1` and `K_1` and all values in the second column must be between `-K_2` and `K_2`. See argument `init` of the function `bayesHistogram` for more details.
`mcmc.par`	a list specifying how some of the G-spline parameters related to `formula` are to be updated. The list can have the following components (all of them have their default values): type.update.a G-spline transformed weights `a` can be updated using one of the following algorithms: slice slice sampler of Neal (2003) ars.quantile adaptive rejection sampling of Gilks and Wild (1992) with starting abscissae being quantiles of the envelop at the previous iteration ars.mode adaptive rejection sampling of Gilks and Wild (1992) with starting abscissae being the mode plus/minus 3 times estimated standard deviation of the full conditional distribution Default is `slice`. k.overrelax.a if `type.update.a == "slice"` some updates are overrelaxed. Then every `k.overrelax.a`th iteration is not overrelaxed. Default is `k.overrelax.a = 1`, i.e. no overrelaxation k.overrelax.sigma G-spline basis standard deviations are updated using the slice sampler of Neal (2003). At the same time, overrelaxation can be used. Then every k.overrelax.sigma th update is not overrelaxed. Default is `k.overrelax.sigma = 1`, i.e. no overrelaxation k.overrelax.scale G-spline scales are updated using the slice sampler of Neal (2003). At the same time, overrelaxation can be used. Then every k.overrelax.scale th update is not overrelaxed. Default is `k.overrelax.scale = 1`, i.e. no overrelaxation
`prior2`	a list specifying the prior distribution of the G-spline defining the distribution of the error term in the regression model given by `formula2`. See `prior` argument of `bayesHistogram` function for more detail.
`prior.beta2`	prior specification for the regression parameters of time-to-event in the case of doubly censored data (related to `formula2`). This should be a list with the same structure as `prior.beta`.
`init2`	an optional list with initial values for the MCMC related to the model given by `formula2`. The list has the same structure as `init`.
`mcmc.par2`	a list specifying how some of the G-spline parameters related to `formula2` are to be updated. The list has the same structure as `mcmc.par`.
`store`	a list of logical values specifying which chains that are not stored by default are to be stored. The list can have the following components. a if `TRUE` then all the transformed mixture weights `a_{k_1,\,k_2},` `k_1=-K_1,\dots,K_1,` `k_2=-K_2,\dots,K_2,` related to the G-spline of `formula` are stored. a2 if `TRUE` and there are doubly-censored data then all the transformed mixture weights `a_{k_1,\,k_2},` `k_1=-K_1,\dots,K_1,` `k_2=-K_2,\dots,K_2,` related to the G-spline of `formula2` are stored. y if `TRUE` then augmented log-event times for all observations related to the `formula` are stored. y2 if `TRUE` then augmented log-event times for all observations related to `formula2` are stored. r if `TRUE` then labels of mixture components for residuals related to `formula` are stored. r2 if `TRUE` then labels of mixture components for residuals related to `formula2` are stored.
`dir`	a string that specifies a directory where all sampled values are to be stored.

Value

A list of class bayesBisurvreg containing an information concerning the initial values and prior choices.

Files created

Additionally, the following files with sampled values are stored in a directory specified by dir argument of this function (some of them are created only on request, see store parameter of this function).

Headers are written to all files created by default and to files asked by the user via the argument store. During the burn-in, only every nsimul$nwrite value is written. After the burn-in, all sampled values are written in files created by default and to files asked by the user via the argument store. In the files for which the corresponding store component is FALSE, every nsimul$nwrite value is written during the whole MCMC (this might be useful to restart the MCMC from some specific point).

The following files are created:

iteration.sim

one column labeled iteration with indeces of MCMC iterations to which the stored sampled values correspond.

mixmoment.sim

columns labeled k, Mean.1, Mean.2, D.1.1, D.2.1, D.2.2, where

k = number of mixture components that had probability numerically higher than zero;

Mean.1 = \mbox{E}(\varepsilon_{i,1});

Mean.2 = \mbox{E}(\varepsilon_{i,2});

D.1.1 = \mbox{var}(\varepsilon_{i,1});

D.2.1 = \mbox{cov}(\varepsilon_{i,1},\,\varepsilon_{i,2});

D.2.2 = \mbox{var}(\varepsilon_{i,2});

all related to the distribution of the error term from the model given by formula.

mixmoment_2.sim

in the case of doubly-censored data, the same structure as mixmoment.sim, however related to the model given by formula2.

mweight.sim

sampled mixture weights w_{k_1,\,k_2} of mixture components that had probabilities numerically higher than zero. Related to the model given by formula.

mweight_2.sim

in the case of doubly-censored data, the same structure as mweight.sim, however related to the model given by formula2.

mmean.sim

indeces k_1,\;k_2, k_1 \in\{-K_1, \dots, K_1\}, k_2 \in\{-K_2, \dots, K_2\} of mixture components that had probabilities numerically higher than zero. It corresponds to the weights in mweight.sim. Related to the model given by formula.

mmean_2.sim

in the case of doubly-censored data, the same structure as mmean.sim, however related to the model given by formula2.

gspline.sim

characteristics of the sampled G-spline (distribution of (\varepsilon_{i,1},\,\varepsilon_{i,2})') related to the model given by formula. This file together with mixmoment.sim, mweight.sim and mmean.sim can be used to reconstruct the G-spline in each MCMC iteration.

The file has columns labeled gamma1, gamma2, sigma1, sigma2, delta1, delta2, intercept1, intercept2, scale1, scale2. The meaning of the values in these columns is the following:

gamma1 = the middle knot \gamma_1 in the first dimension. If ‘Specification’ is 2, this column usually contains zeros;

gamma2 = the middle knot \gamma_2 in the second dimension. If ‘Specification’ is 2, this column usually contains zeros;

sigma1 = basis standard deviation \sigma_1 of the G-spline in the first dimension. This column contains a fixed value if ‘Specification’ is 2;

sigma2 = basis standard deviation \sigma_2 of the G-spline in the second dimension. This column contains a fixed value if ‘Specification’ is 2;

delta1 = distance delta_1 between the two knots of the G-spline in the first dimension. This column contains a fixed value if ‘Specification’ is 2;

delta2 = distance \delta_2 between the two knots of the G-spline in the second dimension. This column contains a fixed value if ‘Specification’ is 2;

intercept1 = the intercept term \alpha_1 of the G-spline in the first dimension. If ‘Specification’ is 1, this column usually contains zeros;

intercept2 = the intercept term \alpha_2 of the G-spline in the second dimension. If ‘Specification’ is 1, this column usually contains zeros;

scale1 = the scale parameter \tau_1 of the G-spline in the first dimension. If ‘Specification’ is 1, this column usually contains ones;

scale2 = the scale parameter \tau_2 of the G-spline in the second dimension. ‘Specification’ is 1, this column usually contains ones.

gspline_2.sim

in the case of doubly-censored data, the same structure as gspline.sim, however related to the model given by formula2.

mlogweight.sim

fully created only if store$a = TRUE. The file contains the transformed weights a_{k_1,\,k_2}, k_1=-K_1,\dots,K_1, k_2=-K_2,\dots,K_2 of all mixture components, i.e. also of components that had numerically zero probabilities. This file is related to the model given by formula.

mlogweight_2.sim

fully created only if store$a2 = TRUE and in the case of doubly-censored data, the same structure as mlogweight.sim, however related to the model given by formula2.

r.sim

fully created only if store$r = TRUE. The file contains the labels of the mixture components into which the residuals are intrinsically assigned. Instead of double indeces (k_1,\,k_2), values from 1 to (2\,K_1+1)\times (2\,K_2+1) are stored here. Function vecr2matr can be used to transform it back to double indeces.

r_2.sim

fully created only if store$r2 = TRUE and in the case of doubly-censored data, the same structure as r.sim, however related to the model given by formula2.

lambda.sim

either one column labeled lambda or two columns labeled lambda1 and lambda2. These are the values of the smoothing parameter(s) \lambda (hyperparameters of the prior distribution of the transformed mixture weights a_{k_1,\,k_2}). This file is related to the model given by formula.

lambda_2.sim

in the case of doubly-censored data, the same structure as lambda.sim, however related to the model given by formula2.

beta.sim

sampled values of the regression parameters \beta related to the model given by formula. The columns are labeled according to the colnames of the design matrix.

beta_2.sim

in the case of doubly-censored data, the same structure as beta.sim, however related to the model given by formula2.

Y.sim

fully created only if store$y = TRUE. It contains sampled (augmented) log-event times for all observations in the data set.

Y_2.sim

fully created only if store$y2 = TRUE and in the case of doubly-censored data, the same structure as Y.sim, however related to the model given by formula2.

logposter.sim

columns labeled loglik, penalty or penalty1 and penalty2, logprw. This file is related to the model given by formula. The columns have the following meaning.

loglik = % -N\Bigl\{\log(2\pi) + \log(\sigma_1) + \log(\sigma_2)\Bigr\}- 0.5\sum_{i=1}^N\Bigl\{ (\sigma_1^2\,\tau_1^2)^{-1}\; (y_{i,1} - x_{i,1}'\beta - \alpha_1 - \tau_1\mu_{1,\,r_{i,1}})^2 + (\sigma_2^2\,\tau_2^2)^{-1}\; (y_{i,2} - x_{i,2}'\beta - \alpha_2 - \tau_2\mu_{2,\,r_{i,2}})^2 \Bigr\}

where y_{i,l} denotes (augmented) (i,l)th true log-event time. In other words, loglik is equal to the conditional log-density \sum_{i=1}^N\,\log\Bigl\{p\bigl((y_{i,1},\,y_{i,2})\;\big|\;r_{i},\,\beta,\,\mbox{G-spline}\bigr)\Bigr\};

penalty1: If prior$neighbor.system = "uniCAR": the penalty term for the first dimension not multiplied by lambda1;

penalty2: If prior$neighbor.system = "uniCAR": the penalty term for the second dimension not multiplied by lambda2;

penalty: If prior$neighbor.system is different from "uniCAR": the penalty term not multiplied by lambda;

logprw = -2\,N\,\log\bigl\{\sum_{k_1}\sum_{k_2}a_{k_1,\,k_2}\bigr\} + \sum_{k_1}\sum_{k_2}N_{k_1,\,k_2}\,a_{k_1,\,k_2}, where N_{k_1,\,k_2} is the number of residuals assigned intrinsincally to the (k_1,\,k_2)th mixture component.

In other words, logprw is equal to the conditional log-density \sum_{i=1}^N \log\bigl\{p(r_i\;|\;\mbox{G-spline weights})\bigr\}.

logposter_2.sim

in the case of doubly-censored data, the same structure as lambda.sim, however related to the model given by formula2.

Author(s)

Arnošt Komárek arnost.komarek@mff.cuni.cz

References

Gilks, W. R. and Wild, P. (1992). Adaptive rejection sampling for Gibbs sampling. Applied Statistics, 41, 337 - 348.

Komárek, A. (2006). Accelerated Failure Time Models for Multivariate Interval-Censored Data with Flexible Distributional Assumptions. PhD. Thesis, Katholieke Universiteit Leuven, Faculteit Wetenschappen.

Komárek, A. and Lesaffre, E. (2006). Bayesian semi-parametric accelerated failure time model for paired doubly interval-censored data. Statistical Modelling, 6, 3 - 22.

Neal, R. M. (2003). Slice sampling (with Discussion). The Annals of Statistics, 31, 705 - 767.

Examples

## See the description of R commands for
## the population averaged AFT model
## with the Signal Tandmobiel data,
## analysis described in Komarek and Lesaffre (2006),
##
## R commands available in the documentation
## directory of this package as
## - see ex-tandmobPA.R and
##   https://www2.karlin.mff.cuni.cz/ komarek/software/bayesSurv/ex-tandmobPA.pdf
##

[Package bayesSurv version 3.7 Index]