coxed-package {coxed} | R Documentation |
Duration-Based Quantities of Interest and Simulation Methods for the Cox Proportional Hazards Model
Description
The Cox proportional hazards model (implemented in R with coxph
in the survival
package or with cph
in the rms
package) is one of the most frequently used estimators
in duration (survival) analysis. Because it is estimated using only the observed durations' rank ordering, typical
quantities of interest used to communicate results of the Cox model come from the hazard function (e.g., hazard ratios
or percentage changes in the hazard rate). These quantities are substantively vague and difficult for many audiences
of research to understand. The coxed
package introduces a suite of methods to address these problems.
The package allows researchers to calculate duration-based quantities from Cox model results, such as the expected
duration (or survival time) given covariate values and marginal changes in duration for a specified change in a
covariate. These duration-based quantities often match better with researchers' substantive interests and are
easily understood by most readers.
In addition, no standard method exists for simulating durations directly from the Cox model's data generating
process because it does not assume a distributional form for the baseline hazard function. The coxed
package
also contains functions to simulate general duration data that does not rely on an assumption of any particular parametric
hazard function.
Duration-Based Quantities of Interest for the Cox Model
The coxed
function generates expected durations for individual
observations and/or marginal changes in expected duration given a change in a covariate
from the Cox proportional hazards model. Specifically, the methods can compute (1) the
expected duration for each observation used to fit the Cox model, given the covariates,
(2) the expected duration for a "new" observation with a covariate profile set by the
analyst, or (3) the first difference, or change, in expected duration given two new data frames.
There are two different methods of generating duration-based quantities in the package.
coxed
with type="npsf"
calculates expected durations by using the method proposed by
Cox and Oakes (1984, 107-109) for estimating the cumulative baseline hazard function. This
method is nonparametric and results in a step-function representation of the cumulative
baseline hazard. Cox and Oakes (1984, 108) show that the cumulative baseline hazard function can be estimated
after fitting a Cox model by
\hat{H}_0(t) = \sum_{\tau_j < t}\frac{d_j}{\sum_{l\in\Re(\tau_j)}\hat{\psi}(l)},
where \tau_j
represents time points earlier than t
, d_j
is a count of the
total number of failures at \tau_j
, \Re(\tau_j)
is the remaining risk set at \tau_j
,
and \hat{\psi}(l)
represents the ELP from the Cox model for observations still in the
risk set at \tau_j
. This equation is used calculate the cumulative baseline hazard at
all time points in the range of observed durations. This estimate is a stepwise function
because time points with no failures do not contribute to the cumulative hazard, so the function
is flat until the next time point with observed failures.
We extend this method to obtain expected durations by first calculating the baseline survivor function from the cumulative hazard function, using
\hat{S}_0(t) = \exp[-\hat{H}_0(t)].
Each observation's survivor function is related to the baseline survivor function by
\hat{S}_i(t) = \hat{S}_0(t)^{\hat{\psi}(i)},
where \hat{\psi}(i)
is the exponentiated linear predictor (ELP) for observation i
.
These survivor functions can be used directly to calculate expected durations for each
observation. The expected value of a non-negative random variable can be calculated by
E(X) = \int_0^{\infty} \bigg(1 - F(t)\bigg)dt,
where F(.)
is the cumulative distribution function for X
. In the case of a
duration variable t_i
, the expected duration is
E(t_i) = \int_0^T S_i(t)\,dt,
where T
is the largest possible duration and S(t)
is the individual's survivor
function. We approximate this integral with a right Riemann-sum by calculating the survivor
functions at every discrete time point from the minimum to the maximum observed durations,
and multiplying these values by the length of the interval between time points with observed failures:
E(t_i) \approx \sum_{t_j \in [0,T]} (t_j - t_{j-1})S_i(t_j).
coxed
with type="gam"
employs a generalized additive model (GAM) to map the model's estimated linear
predictor values to duration times and proceeds according to five steps. First, it uses coefficient
estimates from the Cox model, so researchers must first estimate the model just as they always have.
Then the method computes expected values of risk for each observation by matrix-multiplying the
covariates by the estimated coefficients from the model, then exponentiating the result. This creates
the exponentiated linear predictor (ELP). Then the observations are ranked from smallest to largest
according to their values of the ELP. This ranking is interpreted as the expected order of failure;
the larger the value of the ELP, the sooner the model expects that observation to fail, relative to
the other observations. The next step is to connect the model's expected risk for each observation (ELP) to duration time
(the observed durations). A gam
fits a model to data by using a series of locally-estimated polynomial
splines set by the user (see, for example, Wood, Pya, and Saefken 2016). It is a flexible means of allowing for
the possibility of nonlinear relationships between variables. coxed
with type="gam"
uses a GAM to model the observed
durations as a function of the linear predictor ranks generated in the previous step. More specifically, the method
utilizes a cubic regression spline to draw a smoothed line summarizing the bivariate relationship between
the observed durations and the ranks. The GAM fit can be used directly to compute expected durations, given the covariates, for each observation
in the data.
See Kropko and Harden (2018) for further details about generating expected durations and marginal changes in expected
duration from the Cox model. The coxed
function can also generate these quantities from data with time-varying
covariates (see coxed.npsf.tvc
and coxed.gam.tvc
).
Simulating duration data for the Cox model
The sim.survdata
function generates simulated duration data. It can accept a user-supplied
hazard function, or else it uses the flexible-hazard method described in Harden and Kropko (2018) to generate
a hazard that does not necessarily conform to any parametric hazard function. It can generate data with time-varying
covariates or coefficients. For time-varying covariates type="tvc"
it employs the permutational algorithm by Sylvestre and Abrahamowicz (2008).
For time-varying coefficients with type="tvbeta"
, the first beta coefficient that is either supplied by the user or generated by
the function is multiplied by the natural log of the failure time under consideration.
The flexible-hazard method employed when hazard.fun
is NULL
generates a unique baseline hazard by fitting a curve to
randomly-drawn points. This produces a wide variety
of shapes for the baseline hazard, including those that are unimodal, multimodal, monotonically increasing or decreasing, and many other
shapes. The method then generates a density function based on each baseline hazard and draws durations from it in a way that circumvents
the need to calculate the inverse cumulative baseline hazard. Because the shape of the baseline hazard can vary considerably, this approach
matches the Cox model’s inherent flexibility and better corresponds to the assumed data generating process (DGP) of the Cox model. Moreover,
repeating this process over many iterations in a simulation produces simulated samples of data that better reflect the considerable
heterogeneity in data used by applied researchers. This increases the generalizability of the simulation results. See Harden and Kropko (2018)
for more detail.
When generating a marginal effect, first the user specifies a covariate by typing its column number in the X
matrix into the covariate
argument, then specifies the high and low values at which to fix this covariate. The function calculates the differences in expected duration for each
observation when fixing the covariate to the high and low values. If compare
is median
, the function reports the median of these differences,
and if compare
is mean
, the function reports the median of these differences, but any function may be employed that takes a vector as input and
outputs a scalar.
If censor.cond
is FALSE
then a proportion of the observations specified by censor
is randomly and uniformly selected to be right-censored.
If censor.cond
is TRUE
then censoring depends on the covariates as follows: new coefficients are drawn from normal distributions with mean 0 and
standard deviation of 0.1, and these new coefficients are used to create a new linear predictor using the X
matrix. The observations with the largest
(100 x censor
) percent of the linear predictors are designated as right-censored.
Finally, link[coxed]{survsim.plot}
is useful for visualizing the baseline functions, including hazard, that result from
link[coxed]{sim.survdata}
for a particular draw of simulated duration data. The function uses ggplot
to create line plots for the baseline failure PDF, the baseline failure CDF, the baseline survivor function,
and the baseline hazard function over time. The baseline functions and time are attributes of the
sim.survdata
output.
Author(s)
Jonathan Kropko <jkropko@virginia.edu> and Jeffrey J. Harden <jharden2@nd.edu>
References
Harden, J. J. and Kropko, J. (2018) Simulating Duration Data for the Cox Model. Political Science Research and Methods https://doi.org/10.1017/psrm.2018.19
Kropko, J. and Harden, J. J. (2018) Beyond the Hazard Ratio: Generating Expected Durations from the Cox Proportional Hazards Model. British Journal of Political Science https://doi.org/10.1017/S000712341700045X
Box-Steffensmeier, J. M. (1996) A Dynamic Analysis of The Role of War Chests in Campaign Strategy. American Journal of Political Science 40 352-371
Hyman, J. M. (1983) Accurate monotonicity preserving cubic interpolation. SIAM J. Sci. Stat. Comput. 4, 645–654. https://doi.org/10.1137/0904045
Martin, L. W and Vanberg, G. (2003) Wasting Time? The Impact of Ideology and Size on Delay in Coalition Formation. British Journal of Political Science 33 323-344 https://doi.org/10.1017/S0007123403000140
Sylvestre M.-P., Abrahamowicz M. (2008) Comparison of algorithms to generate event times conditional on time-dependent covariates. Statistics in Medicine 27(14):2618–34. https://doi.org/10.1002/sim.3092
Wood, S.N., N. Pya and B. Saefken (2016) Smoothing parameter and model selection for general smooth models (with discussion). Journal of the American Statistical Association 111, 1548-1575 http://dx.doi.org/10.1080/01621459.2016.1180986
Examples
## See the examples for coxed, sim.survdata, and survsim.plot