dataLong {discSurv} | R Documentation |
Data Long Transformation
Description
Transform data from short format into long format for discrete survival analysis and right censoring. Data is assumed to include no time varying covariates, e. g. no follow up visits are allowed. It is assumed that the covariates stay constant over time, in which no information is available.
Usage
dataLong(
dataShort,
timeColumn,
eventColumn,
timeAsFactor = FALSE,
remLastInt = FALSE,
aggTimeFormat = FALSE,
lastTheoInt = NULL
)
Arguments
dataShort |
Original data in short format ("class data.frame"). |
timeColumn |
Character giving the column name of the observed times. It is required that the observed times are discrete ("integer vector"). |
eventColumn |
Column name of the event indicator ("character vector"). It is required that this is a binary variable with 1=="event" and 0=="censored". |
timeAsFactor |
Should the time intervals be coded as factor ("logical vector")? Default is FALSE. In the default settings the column is treated as quantitative variable ("numeric vector"). |
remLastInt |
Should the last theoretical interval be removed in long format ("logical vector")? Default setting (FALSE) is no deletion. This is only important, if the short format data includes the last theoretic interval [a_q, Inf). There are only events in the last theoretic interval, so the discrete hazard is always one and these observations have to be excluded for estimation. |
aggTimeFormat |
Instead of the usual long format, should every observation have all time intervals ("logical vector")? Default is standard long format (FALSE). In the case of nonlinear risk score models, the time effect has to be integrated out before these can be applied to the C-index. |
lastTheoInt |
Gives the number of the last theoretic interval ("integer vector"). Only used, if argument aggTimeFormat is set to TRUE. |
Details
If the data has continuous survival times, the response may be transformed
to discrete intervals using function contToDisc
. If the data
set has time varying covariates the function dataLongTimeDep
should be used instead. In the case of competing risks and no time varying
covariates see function dataLongCompRisks
.
Value
Original data.frame with three additional columns:
-
obj Index of persons as integer vector
-
timeInt Index of time intervals (factor)
-
y Response in long format as binary vector. 1=="event happens in period timeInt" and zero otherwise. If argument responseAsFactor is set to TRUE, then responses will be coded as factor in one column.
Author(s)
Thomas Welchowski welchow@imbie.meb.uni-bonn.de
Matthias Schmid matthias.schmid@imbie.uni-bonn.de
References
Tutz G, Schmid M (2016).
Modeling discrete time-to-event data.
Springer Series in Statistics.
Fahrmeir L (2005).
“Discrete Survival-Time Models.”
In Encyclopedia of Biostatistics, chapter Survival Analysis.
John Wiley \& Sons.
Thompson Jr. WA (1977).
“On the Treatment of Grouped Observations in Life Studies.”
Biometrics, 33, 463-470.
See Also
contToDisc
, dataLongTimeDep
,
dataLongCompRisks
Examples
# Example unemployment data
library(Ecdat)
data(UnempDur)
# Select subsample
subUnempDur <- UnempDur [1:100, ]
head(subUnempDur)
# Convert to long format
UnempLong <- dataLong (dataShort = subUnempDur, timeColumn = "spell", eventColumn = "censor1")
head(UnempLong, 20)
# Is there exactly one observed event of y for each person?
splitUnempLong <- split(UnempLong, UnempLong$obj)
all(sapply(splitUnempLong, function (x) sum(x$y))==subUnempDur$censor1) # TRUE
# Second example: Acute Myelogenous Leukemia survival data
library(survival)
head(leukemia)
leukLong <- dataLong(dataShort = leukemia, timeColumn = "time",
eventColumn = "status", timeAsFactor=TRUE)
head(leukLong, 30)
# Estimate discrete survival model
estGlm <- glm(formula = y ~ timeInt + x, data=leukLong, family = binomial())
summary(estGlm)
# Estimate survival curves for non-maintained chemotherapy
newDataNonMaintained <- data.frame(timeInt = factor(1:161), x = rep("Nonmaintained"))
predHazNonMain <- predict(estGlm, newdata = newDataNonMaintained, type = "response")
predSurvNonMain <- cumprod(1-predHazNonMain)
# Estimate survival curves for maintained chemotherapy
newDataMaintained <- data.frame(timeInt = factor(1:161), x = rep("Maintained"))
predHazMain <- predict(estGlm, newdata = newDataMaintained, type = "response")
predSurvMain <- cumprod(1-predHazMain)
# Compare survival curves
plot(x = 1:50, y = predSurvMain [1:50], xlab = "Time", ylab = "S(t)", las = 1,
type = "l", main = "Effect of maintained chemotherapy on survival of leukemia patients")
lines(x = 1:161, y = predSurvNonMain, col = "red")
legend("topright", legend = c("Maintained chemotherapy", "Non-maintained chemotherapy"),
col = c("black", "red"), lty = rep(1, 2))
# The maintained therapy has clearly a positive effect on survival over the time range
##############################################
# Simulation
# Single event in case of right-censoring
# Simulate multivariate normal distribution
library(discSurv)
library(mvnfast)
set.seed(-1980)
X <- mvnfast::rmvn(n = 1000, mu = rep(0, 10), sigma = diag(10))
# Specification of discrete hazards with 11 theoretical intervals
betaCoef <- seq(-1, 1, length.out = 11)[-6]
timeInt <- seq(-1, 1, length.out = 10)
linPred <- c(X %*% betaCoef)
hazTimeX <- cbind(sapply(1:length(timeInt),
function(x) exp(linPred+timeInt[x]) / (1+exp(linPred+timeInt[x])) ), 1)
# Simulate discrete survival and censoring times in 10 observed intervals
discT <- rep(NA, dim(hazTimeX)[1])
discC <- rep(NA, dim(hazTimeX)[1])
for( i in 1:dim(hazTimeX)[1] ){
discT[i] <- sample(1:11, size = 1, prob = estMargProb(haz=hazTimeX[i, ]))
discC[i] <- sample(1:11, size = 1, prob = c(rep(1/11, 11)))
}
# Calculate observed times, event indicator and specify short data format
eventInd <- discT <= discC
obsT <- ifelse(eventInd, discT, discC)
eventInd[obsT == 11] <- 0
obsT[obsT == 11] <- 10
simDatShort <- data.frame(obsT = obsT, event = as.numeric(eventInd), X)
# Convert data to discrete data long format
simDatLong <- dataLong(dataShort = simDatShort, timeColumn = "obsT", eventColumn = "event",
timeAsFactor=TRUE)
# Estimate discrete-time continuation ratio model
formSpec <- as.formula(paste("y ~ timeInt + ",
paste(paste("X", 1:10, sep=""), collapse = " + "), sep = ""))
modelFit <- glm(formula = formSpec, data = simDatLong, family = binomial(link = "logit"))
summary(modelFit)
# Compare estimated to true coefficients
coefModel <- coef(modelFit)
MSE_covariates <- mean((coefModel[11:20]-timeInt)^2)
MSE_covariates
# -> Estimated coefficients are near true coefficients