R: Split case-level observations

lexpand {popEpi}

R Documentation

Split case-level observations

Description

Given subject-level data, data is split by calendar time (per), age, and follow-up time (fot, from 0 to the end of follow-up) into subject-time-interval rows according to given breaks and additionally processed if requested.

Usage

lexpand(
  data,
  birth = NULL,
  entry = NULL,
  exit = NULL,
  event = NULL,
  status = status != 0,
  entry.status = NULL,
  breaks = list(fot = c(0, Inf)),
  id = NULL,
  overlapping = TRUE,
  aggre = NULL,
  aggre.type = c("unique", "cartesian"),
  drop = TRUE,
  pophaz = NULL,
  pp = TRUE,
  subset = NULL,
  merge = TRUE,
  verbose = FALSE,
  ...
)

Arguments

`data`	dataset of e.g. cancer cases as rows
`birth`	birth time in date format or fractional years; string, symbol or expression
`entry`	entry time in date format or fractional years; string, symbol or expression
`exit`	exit from follow-up time in date format or fractional years; string, symbol or expression
`event`	advanced: time of possible event differing from `exit`; typically only used in certain SIR/SMR calculations - see Details; string, symbol or expression
`status`	variable indicating type of event at `exit` or `event`; e.g. `status = status != 0`; expression or quoted variable name
`entry.status`	input in the same way as `status`; status at `entry`; see Details
`breaks`	a named list of vectors of time breaks; e.g. `breaks = list(fot=0:5, age=c(0,45,65,Inf))`; see Details
`id`	optional; an id variable; e.g. `id = my_id`; string, symbol or expression
`overlapping`	advanced, logical; if `FALSE` AND if `data` contains multiple rows per subject, ensures that the timelines of `id`-specific rows do not overlap; this ensures e.g. that person-years are only computed once per subject in a multi-state paradigm
`aggre`	e.g. `aggre = list(sex, fot)`; a list of unquoted variables and/or expressions thereof, which are interpreted as factors; data events and person-years will be aggregated by the unique combinations of these; see Details
`aggre.type`	one of `c("unique","cartesian")`; can be abbreviated; see Details
`drop`	logical; if `TRUE`, drops all resulting rows after splitting that reside outside the time window as defined by the given breaks (all time scales)
`pophaz`	a dataset of population hazards to merge with split data; see Details
`pp`	logical; if `TRUE`, computes Pohar-Perme weights using `pophaz`; adds variable with reserved name `pp`; see Details for computing method
`subset`	a logical vector or any logical condition; data is subsetted before splitting accordingly
`merge`	logical; if `TRUE`, retains all original variables from the data
`verbose`	logical; if `TRUE`, the function is chatty and returns some messages along the way
`...`	e.g. `fot = 0:5`; instead of specifying a `breaks` list, correctly named breaks vectors can be given for `fot`, `age`, and `per`; these override any breaks in the `breaks` list; see Examples

Details

Basics

lexpand splits a given data set (with e.g. cancer diagnoses as rows) to subintervals of time over calendar time, age, and follow-up time with given time breaks using splitMulti.

The dataset must contain appropriate Date / IDate / date format or other numeric variables that can be used as the time variables.

You may take a look at a simulated cohort sire as an example of the minimum required information for processing data with lexpand.

Many arguments can be supplied as a character string naming the appropriate variable (e.g. "sex"), as a symbol (e.g. sex) or as an expression (e.g. factor(sex, 0:1, c("m", "f"))) for flexibility.

Breaks

You should define all breaks as left inclusive and right exclusive time points (e.g.[a,b) ) for 1-3 time dimensions so that the last member of a breaks vector is a meaningful "final upper limit", e.g. per = c(2002,2007,2012) to create a last subinterval of the form [2007,2012).

All breaks are explicit, i.e. if drop = TRUE, any data beyond the outermost breaks points are dropped. If one wants to have unspecified upper / lower limits on one time scale, use Inf: e.g. breaks = list(fot = 0:5, age = c(0,45,Inf)). Breaks for per can also be given in Date/IDate/date format, whereupon they are converted to fractional years before used in splitting.

The age time scale can additionally be automatically split into common age grouping schemes by naming the scheme with an appropriate character string:

"18of5": age groups 0-4, 5-9, 10-14, ..., 75-79, 80-84, 85+
"20of5": age groups 0-4, 5-9, 10-14, ..., 85-89, 90-94, 95+
"101of1": age groups 0, 1, 2, ..., 98, 99, 100+

Time variables

If any of the given time variables (birth, entry, exit, event) is in any kind of date format, they are first coerced to fractional years before splitting using get.yrs (with year.length = "actual").

Sometimes in e.g. SIR/SMR calculation one may want the event time to differ from the time of exit from follow-up, if the subject is still considered to be at risk of the event. If event is specified, the transition to status is moved to event from exit using cutLexis. See Examples.

The status variable

The statuses in the expanded output (lex.Cst and lex.Xst) are determined by using either only status or both status and entry.status. If entry.status = NULL, the status at entry is guessed according to the type of variable supplied via status: For numeric variables it will be zero, for factors the first level (levels(status)[1]) and otherwise the first unique value in alphabetical order (sort(unique(status))[1]).

Using numeric or factor status variables is strongly recommended. Logical expressions are also allowed (e.g. status = my_status != 0L) and are converted to integer internally.

Merging population hazard information

To enable computing relative/net survivals with survtab and relpois, lexpand merges an appropriate population hazard data (pophaz) to the expanded data before dropping rows outside the specified time window (if drop = TRUE). pophaz must, for this reason, contain at a minimum the variables named agegroup, year, and haz. pophaz may contain additional variables to specify different population hazard levels in different strata; e.g. popmort includes sex. All the strata-defining variables must be present in the supplied data. lexpand will automatically detect variables with common names in the two datasets and merge using them.

Currently year must be an integer variable specifying the appropriate year. agegroup must currently also specify one-year age groups, e.g. popmort specifies 101 age groups of length 1 year. In both year and agegroup variables the values are interpreted as the lower bounds of intervals (and passed on to a cut call). The mandatory variable haz must specify the appropriate average rate at the person-year level; e.g. haz = -log(survProb) where survProb is a one-year conditional survival probability will be the correct hazard specification.

The corresponding pophaz population hazard value is merged by using the mid points of the records after splitting as reference values. E.g. if age=89.9 at the start of a 1-year interval, then the reference age value is 90.4 for merging. This way we get a "typical" population hazard level for each record.

Computing Pohar-Perme weights

If pp = TRUE, Pohar-Perme weights (the inverse of cumulative population survival) are computed. This will create the new pp variable in the expanded data. pp is a reserved name and lexpand throws exception if a variable with that name exists in data.

When a survival interval contains one or several rows per subject (e.g. due to splitting by the per scale), pp is cumulated from the beginning of the first record in a survival interval for each subject to the mid-point of the remaining time within that survival interval, and that value is given for every other record that a given person has within the same survival interval.

E.g. with 5 rows of duration 1/5 within a survival interval [0,1)], pp is determined for all records by a cumulative population survival from 0 to 0.5. The existing accuracy is used, so that the weight is cumulated first up to the end of the second row and then over the remaining distance to the mid-point (first to 0.4, then to 0.5). This ensures that more accurately merged population hazards are fully used.

Event not at end of follow-up & overlapping time lines

event may be used if the event indicated by status should occur at a time differing from exit. If event is defined, cutLexis is used on the data set after coercing it to the Lexis format and before splitting. Note that some values of event are allowed to be NA as with cutLexis to accommodate observations without an event occurring.

Additionally, setting overlapping = FALSE ensures that (irrespective of using event) the each subject defined by id only has one continuous time line instead of possibly overlapping time lines if there are multiple rows in data by id.

Aggregating

Certain analyses such as SIR/SMR calculations require tables of events and person-years by the unique combinations (interactions) of several variables. For this, aggre can be specified as a list of such variables (preferably factor variables but not mandatory) and any arbitrary functions of the variables at one's disposal. E.g.

aggre = list(sex, agegr = cut(dg_age, 0:100))

would tabulate events and person-years by sex and an ad-hoc age group variable. Every ad-hoc-created variable should be named.

fot, per, and age are special reserved variables which, when present in the aggre list, are output as categories of the corresponding time scale variables by using e.g.

cut(fot, breaks$fot, right=FALSE).

This only works if the corresponding breaks are defined in breaks or via "...". E.g.

aggre = list(sex, fot.int = fot) with

breaks = list(fot=0:5).

The output variable fot.int in the above example will have the lower limits of the appropriate intervals as values.

aggre as a named list will output numbers of events and person-years with the given new names as categorizing variable names, e.g. aggre = list(follow_up = fot, gender = sex, agegroup = age).

The output table has person-years (pyrs) and event counts (e.g. from0to1) as columns. Event counts are the numbers of transitions (lex.Cst != lex.Xst) or the lex.Xst value at a subject's last record (subject possibly defined by id).

If aggre.type = "unique" (alias "non-empty"), the above results are computed for existing combinations of expressions given in aggre, but also for non-existing combinations if aggre.type = "cartesian" (alias "full"). E.g. if a factor variable has levels "a", "b", "c" but the data is limited to only have levels "a", "b" present (more than zero rows have these level values), the former setting only computes results for "a", "b", and the latter also for "c" and any combination with other variables or expression given in aggre. In essence, "cartesian" forces also combinations of variables used in aggre that have no match in data to be shown in the result.

If aggre is not NULL and pophaz has been supplied, lexpand also aggregates the expected counts of events, which appears in the output data by the reserved name d.exp. Additionally, having pp = TRUE causes lexpand to also compute various Pohar-Perme weighted figures necessary for computing Pohar-Perme net survivals with survtab_ag. This can be slow, so consider what is really needed. The Pohar-Perme weighted figures have the suffix .pp.

Value

If aggre = NULL, returns a data.table or data.frame (depending on options("popEpi.datatable"); see ?popEpi) object expanded to accommodate split observations with time scales as fractional years and pophaz merged in if given. Population hazard levels in new variable pop.haz, and Pohar-Perme weights as new variable pp if requested.

If aggre is defined, returns a long-format data.table/data.frame with the variable pyrs (person-years), and variables for the counts of transitions in state or state at end of follow-up formatted fromXtoY, where X and Y are the states transitioned from and to, respectively. The data may also have the columns d.exp for expected numbers of cases and various Pohar-Perme weighted figures as identified by the suffix .pp; see Details.

Author(s)

Joonas Miettinen

Examples


## prepare data for e.g. 5-year cohort survival calculation
x <- lexpand(sire, breaks=list(fot=seq(0, 5, by = 1/12)), 
             birth = bi_date, entry = dg_date, exit = ex_date,
             status =  status != 0, pophaz=popmort)

## prepare data for e.g. 5-year "period analysis" for 2008-2012
BL <- list(fot = seq(0, 5, by = 1/12), per = c("2008-01-01", "2013-01-01"))
x <- lexpand(sire, breaks = BL, 
             birth = bi_date, entry = dg_date, exit = ex_date,
             pophaz=popmort, status =  status != 0)

## aggregating
BL <- list(fot = 0:5, per = c("2003-01-01","2008-01-01", "2013-01-01"))
ag <- lexpand(sire, breaks = BL, status = status != 0, 
             birth = bi_date, entry = dg_date, exit = ex_date,
              aggre=list(sex, period = per, surv.int = fot))

## aggregating even more
ag <- lexpand(sire, breaks = BL, status = status != 0, 
              birth = bi_date, entry = dg_date, exit = ex_date,
              aggre=list(sex, period = per, surv.int = fot),
              pophaz = popmort, pp = TRUE)

## using "..."
x <- lexpand(sire, fot=0:5, status =  status != 0,
             birth = bi_date, entry = dg_date, exit = ex_date,
             pophaz=popmort) 

x <- lexpand(sire, fot=0:5, status =  status != 0, 
             birth = bi_date, entry = dg_date, exit = ex_date,
             aggre=list(sex, surv.int = fot))
             
## using the "event" argument: it just places the transition to given "status"
## at the "event" time instead of at the end, if possible using cutLexis
x <- lexpand(sire, status = status, event = dg_date,
             birth = bi_date, entry = dg_date, exit = ex_date,) 

## aggregating with custom "event" time
## (the transition to status is moved to the "event" time)
x <- lexpand(sire, status = status, event = dg_date, 
             birth = bi_date, entry = dg_date, exit = ex_date,
             per = 1970:2014, age = c(0:100,Inf),
             aggre = list(sex, year = per, agegroup = age))

[Package popEpi version 0.4.12 Index]