lexpand {popEpi} | R Documentation |
Split case-level observations
Description
Given subject-level data, data is split
by calendar time (per
), age
, and follow-up
time (fot
, from 0 to the end of follow-up)
into subject-time-interval rows according to
given breaks
and additionally processed if requested.
Usage
lexpand(
data,
birth = NULL,
entry = NULL,
exit = NULL,
event = NULL,
status = status != 0,
entry.status = NULL,
breaks = list(fot = c(0, Inf)),
id = NULL,
overlapping = TRUE,
aggre = NULL,
aggre.type = c("unique", "cartesian"),
drop = TRUE,
pophaz = NULL,
pp = TRUE,
subset = NULL,
merge = TRUE,
verbose = FALSE,
...
)
Arguments
data |
dataset of e.g. cancer cases as rows |
birth |
birth time in date format or fractional years; string, symbol or expression |
entry |
entry time in date format or fractional years; string, symbol or expression |
exit |
exit from follow-up time in date format or fractional years; string, symbol or expression |
event |
advanced: time of possible event differing from |
status |
variable indicating type of event at |
entry.status |
input in the same way as |
breaks |
a named list of vectors of time breaks;
e.g. |
id |
optional; an id variable; e.g. |
overlapping |
advanced, logical; if |
aggre |
e.g. |
aggre.type |
one of |
drop |
logical; if |
pophaz |
a dataset of population hazards to merge with split data; see Details |
pp |
logical; if |
subset |
a logical vector or any logical condition; data is subsetted before splitting accordingly |
merge |
logical; if |
verbose |
logical; if |
... |
e.g. |
Details
Basics
lexpand
splits a given data set (with e.g. cancer diagnoses
as rows) to subintervals of time over
calendar time, age, and follow-up time with given time breaks
using splitMulti
.
The dataset must contain appropriate
Date
/ IDate
/ date
format or
other numeric variables that can be used
as the time variables.
You may take a look at a simulated cohort
sire
as an example of the
minimum required information for processing data with lexpand
.
Many arguments can be supplied as a character string naming the appropriate
variable (e.g. "sex"
), as a symbol (e.g. sex
) or as an expression
(e.g. factor(sex, 0:1, c("m", "f"))
) for flexibility.
Breaks
You should define all breaks as left inclusive and right exclusive
time points (e.g.[a,b)
)
for 1-3 time dimensions so that the last member of a breaks vector
is a meaningful "final upper limit",
e.g. per = c(2002,2007,2012)
to create a last subinterval of the form [2007,2012)
.
All breaks are explicit, i.e. if drop = TRUE
,
any data beyond the outermost breaks points are dropped.
If one wants to have unspecified upper / lower limits on one time scale,
use Inf
: e.g. breaks = list(fot = 0:5, age = c(0,45,Inf))
.
Breaks for per
can also be given in
Date
/IDate
/date
format, whereupon
they are converted to fractional years before used in splitting.
The age
time scale can additionally
be automatically split into common age grouping schemes
by naming the scheme with an appropriate character string:
-
"18of5"
: age groups 0-4, 5-9, 10-14, ..., 75-79, 80-84, 85+ -
"20of5"
: age groups 0-4, 5-9, 10-14, ..., 85-89, 90-94, 95+ -
"101of1"
: age groups 0, 1, 2, ..., 98, 99, 100+
Time variables
If any of the given time variables
(birth
, entry
, exit
, event
)
is in any kind of date format, they are first coerced to
fractional years before splitting
using get.yrs
(with year.length = "actual"
).
Sometimes in e.g. SIR/SMR calculation one may want the event time to differ
from the time of exit from follow-up, if the subject is still considered
to be at risk of the event. If event
is specified, the transition to
status
is moved to event
from exit
using cutLexis
. See Examples.
The status variable
The statuses in the expanded output (lex.Cst
and lex.Xst
)
are determined by using either only status
or both status
and entry.status
. If entry.status = NULL
, the status at entry
is guessed according to the type of variable supplied via status
:
For numeric variables it will be zero, for factors the first level
(levels(status)[1]
) and otherwise the first unique value in alphabetical
order (sort(unique(status))[1]
).
Using numeric or factor status
variables is strongly recommended. Logical expressions are also allowed
(e.g. status = my_status != 0L
) and are converted to integer internally.
Merging population hazard information
To enable computing relative/net survivals with survtab
and relpois
, lexpand
merges an appropriate
population hazard data (pophaz
) to the expanded data
before dropping rows outside the specified
time window (if drop = TRUE
). pophaz
must, for this reason,
contain at a minimum the variables named
agegroup
, year
, and haz
. pophaz
may contain additional variables to specify
different population hazard levels in different strata; e.g. popmort
includes sex
.
All the strata-defining variables must be present in the supplied data
. lexpand
will
automatically detect variables with common names in the two datasets and merge using them.
Currently year
must be an integer variable specifying the appropriate year. agegroup
must currently also specify one-year age groups, e.g. popmort
specifies 101 age groups
of length 1 year. In both
year
and agegroup
variables the values are interpreted as the lower bounds of intervals
(and passed on to a cut
call). The mandatory variable haz
must specify the appropriate average rate at the person-year level;
e.g. haz = -log(survProb)
where survProb
is a one-year conditional
survival probability will be the correct hazard specification.
The corresponding pophaz
population hazard value is merged by using the mid points
of the records after splitting as reference values. E.g. if age=89.9
at the start
of a 1-year interval, then the reference age value is 90.4
for merging.
This way we get a "typical" population hazard level for each record.
Computing Pohar-Perme weights
If pp = TRUE
, Pohar-Perme weights
(the inverse of cumulative population survival) are computed. This will
create the new pp
variable in the expanded data. pp
is a
reserved name and lexpand
throws exception if a variable with that name
exists in data
.
When a survival interval contains one or several rows per subject
(e.g. due to splitting by the per
scale),
pp
is cumulated from the beginning of the first record in a survival
interval for each subject to the mid-point of the remaining time within that
survival interval, and that value is given for every other record
that a given person has within the same survival interval.
E.g. with 5 rows of duration 1/5
within a survival interval
[0,1)]
, pp
is determined for all records by a cumulative
population survival from 0
to 0.5
. The existing accuracy is used,
so that the weight is cumulated first up to the end of the second row
and then over the remaining distance to the mid-point (first to 0.4, then to
0.5). This ensures that more accurately merged population hazards are fully
used.
Event not at end of follow-up & overlapping time lines
event
may be used if the event indicated by status
should
occur at a time differing from exit
. If event
is defined,
cutLexis
is used on the data set after coercing it to the Lexis
format and before splitting. Note that some values of event
are allowed
to be NA
as with cutLexis
to accommodate observations
without an event occurring.
Additionally, setting overlapping = FALSE
ensures that (irrespective
of using event
) the each subject defined by id
only has one
continuous time line instead of possibly overlapping time lines if
there are multiple rows in data
by id
.
Aggregating
Certain analyses such as SIR/SMR calculations require tables of events and
person-years by the unique combinations (interactions) of several variables.
For this, aggre
can be specified as a list of such variables
(preferably factor
variables but not mandatory)
and any arbitrary functions of the
variables at one's disposal. E.g.
aggre = list(sex, agegr = cut(dg_age, 0:100))
would tabulate events and person-years by sex and an ad-hoc age group variable. Every ad-hoc-created variable should be named.
fot
, per
, and age
are special reserved variables which,
when present in the aggre
list, are output as categories of the
corresponding time scale variables by using
e.g.
cut(fot, breaks$fot, right=FALSE)
.
This only works if
the corresponding breaks are defined in breaks
or via "...
".
E.g.
aggre = list(sex, fot.int = fot)
with
breaks = list(fot=0:5)
.
The output variable fot.int
in the above example will have
the lower limits of the appropriate intervals as values.
aggre
as a named list will output numbers of events and person-years
with the given new names as categorizing variable names, e.g.
aggre = list(follow_up = fot, gender = sex, agegroup = age)
.
The output table has person-years (pyrs
) and event counts
(e.g. from0to1
) as columns. Event counts are the numbers of transitions
(lex.Cst != lex.Xst
) or the lex.Xst
value at a subject's
last record (subject possibly defined by id
).
If aggre.type = "unique"
(alias "non-empty"
),
the above results are computed for existing
combinations of expressions given in aggre
, but also for non-existing
combinations if aggre.type = "cartesian"
(alias "full"
). E.g. if a
factor variable has levels "a", "b", "c"
but the data is limited
to only have levels "a", "b"
present
(more than zero rows have these level values), the former setting only
computes results for "a", "b"
, and the latter also for "c"
and any combination with other variables or expression given in aggre
.
In essence, "cartesian"
forces also combinations of variables used
in aggre
that have no match in data to be shown in the result.
If aggre
is not NULL
and pophaz
has been supplied,
lexpand
also aggregates the expected counts of events, which
appears in the output data by the reserved name d.exp
. Additionally,
having pp = TRUE
causes lexpand
to also compute various
Pohar-Perme weighted figures necessary for computing Pohar-Perme net survivals
with survtab_ag
. This can be slow, so consider what is really
needed. The Pohar-Perme weighted figures have the suffix .pp
.
Value
If aggre = NULL
, returns
a data.table
or data.frame
(depending on options("popEpi.datatable")
; see ?popEpi
)
object expanded to accommodate split observations with time scales as
fractional years and pophaz
merged in if given. Population
hazard levels in new variable pop.haz
, and Pohar-Perme
weights as new variable pp
if requested.
If aggre
is defined, returns a long-format
data.table
/data.frame
with the variable pyrs
(person-years),
and variables for the counts of transitions in state or state at end of
follow-up formatted fromXtoY
, where X
and Y
are
the states transitioned from and to, respectively. The data may also have
the columns d.exp
for expected numbers of cases and various
Pohar-Perme weighted figures as identified by the suffix .pp
; see
Details.
Author(s)
Joonas Miettinen
See Also
Other splitting functions:
splitLexisDT()
,
splitMulti()
Other aggregation functions:
aggre()
,
as.aggre()
,
setaggre()
,
summary.aggre()
Examples
## prepare data for e.g. 5-year cohort survival calculation
x <- lexpand(sire, breaks=list(fot=seq(0, 5, by = 1/12)),
birth = bi_date, entry = dg_date, exit = ex_date,
status = status != 0, pophaz=popmort)
## prepare data for e.g. 5-year "period analysis" for 2008-2012
BL <- list(fot = seq(0, 5, by = 1/12), per = c("2008-01-01", "2013-01-01"))
x <- lexpand(sire, breaks = BL,
birth = bi_date, entry = dg_date, exit = ex_date,
pophaz=popmort, status = status != 0)
## aggregating
BL <- list(fot = 0:5, per = c("2003-01-01","2008-01-01", "2013-01-01"))
ag <- lexpand(sire, breaks = BL, status = status != 0,
birth = bi_date, entry = dg_date, exit = ex_date,
aggre=list(sex, period = per, surv.int = fot))
## aggregating even more
ag <- lexpand(sire, breaks = BL, status = status != 0,
birth = bi_date, entry = dg_date, exit = ex_date,
aggre=list(sex, period = per, surv.int = fot),
pophaz = popmort, pp = TRUE)
## using "..."
x <- lexpand(sire, fot=0:5, status = status != 0,
birth = bi_date, entry = dg_date, exit = ex_date,
pophaz=popmort)
x <- lexpand(sire, fot=0:5, status = status != 0,
birth = bi_date, entry = dg_date, exit = ex_date,
aggre=list(sex, surv.int = fot))
## using the "event" argument: it just places the transition to given "status"
## at the "event" time instead of at the end, if possible using cutLexis
x <- lexpand(sire, status = status, event = dg_date,
birth = bi_date, entry = dg_date, exit = ex_date,)
## aggregating with custom "event" time
## (the transition to status is moved to the "event" time)
x <- lexpand(sire, status = status, event = dg_date,
birth = bi_date, entry = dg_date, exit = ex_date,
per = 1970:2014, age = c(0:100,Inf),
aggre = list(sex, year = per, agegroup = age))