R: Data Long Competing Risks Transformation

dataLongCompRisks {discSurv}

R Documentation

Data Long Competing Risks Transformation

Description

Transforms short data format to long format for discrete survival modelling in the case of competing risks with right censoring. It is assumed that the covariates are not time varying.

Usage

dataLongCompRisks(
  dataShort,
  timeColumn,
  eventColumns,
  eventColumnsAsFactor = FALSE,
  timeAsFactor = FALSE,
  aggTimeFormat = FALSE,
  lastTheoInt = NULL,
  responseAsFactor = FALSE
)

Arguments

`dataShort`	Original data in short format ("class data.frame").
`timeColumn`	Character giving the column name of the observed times ("character vector"). It is required that the observed times are discrete ("integer vector").
`eventColumns`	Character vector giving the column names of the event indicators (excluding censoring column)("character vector"). It is required that all events are binary encoded. If the sum of all event indicators is zero, then this is interpreted as a censored observation. Alternatively a column name of a factor representing competing events can be given. In this case the argument eventColumnsAsFactor has to be set TRUE and the first level is assumed to represent censoring.
`eventColumnsAsFactor`	Should the argument eventColumns be interpreted as column name of a factor variable ("logical vector")? Default is FALSE.
`timeAsFactor`	Should the time intervals be coded as factor ("logical vector")? Default is FALSE. In the default settings the discrete time variable are treated as quantitative.
`aggTimeFormat`	Instead of the usual long format, should every observation have all time intervals ("logical vector")? Default is standard long format. In the case of nonlinear risk score models, the time effect has to be integrated out before these can be applied to the C-index.
`lastTheoInt`	Gives the number of the last theoretic interval ("integer vector"). Only used, if aggTimeFormat is set to TRUE.
`responseAsFactor`	Should the response columns be given as factor ("logical vector")? Default is FALSE.

Details

It is assumed, that only one event happens at a specific time point (competing risks). Either the observation is censored or one of the possible events takes place.

In contrast to continuous survival (see e. g. Surv) the start and stop time notation is not used here. In discrete time survival analysis the only relevant information is to use the stop time. Start time does not matter, because all discrete intervals need to be included in the long data set format to ensure consistent estimation. It is assumed that the supplied data set dataShort contains all repeated measurements of each cluster (e. g. persons). For further information see example Start-stop notation.

Value

Original data set in long format with additional columns

obj Gives identification number of objects (row index in short format) (integer)
timeInt Gives number of discrete time intervals (factor)
responses Columns with dimension count of events + 1 (censoring)
- e0 No event (observation censored in specific interval)
- e1 Indicator of first event, 1 if event takes place and 0 otherwise
- ... ...
- ek Indicator of last k-th event, 1 if event takes place and zero otherwise
If argument responseAsFactor=TRUE, then responses will be coded as factor in one column.

Author(s)

Thomas Welchowski welchow@imbie.meb.uni-bonn.de

References

Tutz G, Schmid M (2016). Modeling discrete time-to-event data. Springer Series in Statistics.

Steele F, Goldstein H, Browne W (2004). “A general multilevel multistate competing risks model for event history data, with an application to a study of contraceptive use dynamics.” Statistical Modelling, 4, 145-159.

Narendranathan W, Stewart MB (1993). “Modelling the Probability of Leaving Unemployment: Competing Risks Models with Flexible Base-Line Hazards.” Journal of the Royal Statistical Society Series C, 42, 63-83.

Examples


# Example with unemployment data
library(Ecdat)
data(UnempDur)

# Select subsample
SubUnempDur <- UnempDur [1:100, ]

# Convert competing risk data to long format
SubUnempDurLong <- dataLongCompRisks (dataShort = SubUnempDur, timeColumn = "spell", 
eventColumns = c("censor1", "censor2", "censor3", "censor4"))
head(SubUnempDurLong, 20)

# Fit multinomial logit model with VGAM package
# with one coefficient per response
library(VGAM)
multLogitVGM <- vgam(cbind(e0, e1, e2, e3, e4) ~ timeInt + ui + age + logwage,
                    family = multinomial(refLevel = 1), 
                    data = SubUnempDurLong)
coef(multLogitVGM)

# Alternative: Use nnet
# Convert response to factor
rawResponseMat <- SubUnempDurLong[, c("e0", "e1", "e2", "e3", "e4")]
NewFactor <- factor(unname(apply(rawResponseMat, 1, function(x) which(x == 1))), 
                    labels = colnames(rawResponseMat))

# Include recoded response in data
SubUnempDurLong <- cbind(SubUnempDurLong, NewResp = NewFactor)

# Construct formula of mlogit model
mlogitFormula <- formula(NewResp ~ timeInt + ui + age + logwage)

# Fit multinomial logit model
# with one coefficient per response
library(nnet)
multLogitNNET <- multinom(formula = mlogitFormula, data = SubUnempDurLong)
coef(multLogitNNET)

###########################################################
# Simulation
# Cause specific competing risks in case of right-censoring
# Discrete subdistribution hazards model

# Simulate covariates as multivariate normal distribution
library(mvnfast)
set.seed(1980)
X <- mvnfast::rmvn(n = 1000, mu = rep(0, 4), sigma = diag(4))

# Specification of two discrete cause specific hazards with four intervals
# Event 1
theoInterval <- 4
betaCoef_event1 <- seq(-1, 1, length.out = 5)[-3]
timeInt_event1 <- seq(0.1, -0.1, length.out = theoInterval-1)
linPred_event1 <- c(X %*% betaCoef_event1)
# Event 2
betaCoef_event2 <- seq(-0.5, 0.5, length.out = 5)[-3]
timeInt_event2 <- seq(-0.1, 0.1, length.out = theoInterval-1)
linPred_event2 <- c(X %*% betaCoef_event2)
# Discrete cause specific hazards in last theoretical interval
theoHaz_event1 <- 0.5
theoHaz_event2 <- 0.5

haz_event1_X <- cbind(sapply(1:length(timeInt_event1), 
               function(x) exp(linPred_event1 + timeInt_event1[x]) / 
               (1 + exp(linPred_event1 + timeInt_event1[x]) + 
               exp(linPred_event2 + timeInt_event2[x])) ), theoHaz_event1)

haz_event2_X <- cbind(sapply(1:length(timeInt_event2), 
               function(x) exp(linPred_event2 + timeInt_event2[x]) / 
               (1 + exp(linPred_event1 + timeInt_event1[x]) + 
               exp(linPred_event2 + timeInt_event2[x]) ) ), theoHaz_event2)
allCauseHaz_X <- haz_event1_X + haz_event2_X


pT_X <- t(sapply(1:dim(allCauseHaz_X)[1], function(i) estMargProb(allCauseHaz_X[i, ]) ))


pR_T_X_event1 <- haz_event1_X / (haz_event1_X + haz_event2_X)


survT <- sapply(1:dim(pT_X)[1], function(i) sample(x = 1:(length(timeInt_event1) + 1), 
                                                  size = 1, prob = pT_X[i, ]) )
censT <- sample(x = 1:(length(timeInt_event1)+1), size = dim(pT_X)[1], 
               prob = rep(1/(length(timeInt_event1) + 1), (length(timeInt_event1) + 1)), 
               replace = TRUE)


obsT <- ifelse(survT <= censT, survT, censT)
obsEvent <- rep(0, length(obsT))
obsEvent <- sapply(1:length(obsT), 
                  function(i) if(survT[i] <= censT[i]){
                    return(sample(x = c(1, 2), size=1, 
                    prob = c(pR_T_X_event1[i, obsT[i]  ], 
                    1 - pR_T_X_event1[i, obsT[i]  ]) ))
                  } else{
                    
                    return(0)
                  }
)


# Recode last interval to censored
lastInterval <- obsT == theoInterval
obsT[lastInterval] <- theoInterval - 1
obsEvent[lastInterval] <- 0
obsT <- factor(obsT)
obsEvent <- factor(obsEvent)

datShort <- data.frame(event = factor(obsEvent), time = obsT, X)
datLong <- dataLongCompRisks(dataShort = datShort, timeColumn = "time", 
                            eventColumns = "event", responseAsFactor = TRUE, 
                            eventColumnsAsFactor = TRUE, timeAsFactor = TRUE)


# Estimate discrete cause specific hazard model
library(VGAM)
estModel <- vglm(formula=responses ~ timeInt + X1 + X2 + X3 + X4, data=datLong, 
                family = multinomial(refLevel = 1))


# Mean squared errors per event
coefModels <- coef(estModel)
mean((coefModels[seq(7, length(coefModels), 2)] - betaCoef_event1)^2) # Event 1
mean((coefModels[seq(8, length(coefModels), 2)] - betaCoef_event2)^2) # Event 2
# -> Estimated coefficients are near true coefficients for each event type

[Package discSurv version 2.0.0 Index]