R: Episodic calculation of time-since-event data

time_episodes {timeplyr}

R Documentation

Episodic calculation of time-since-event data

Description

This function assigns episodes to events based on a pre-defined threshold of a chosen time unit.

Usage

time_episodes(
  data,
  time,
  time_by = NULL,
  window = 1,
  roll_episode = TRUE,
  switch_on_boundary = TRUE,
  fill = 0,
  .add = FALSE,
  event = NULL,
  time_type = getOption("timeplyr.time_type", "auto"),
  .by = NULL
)

Arguments

`data`	A data frame.
`time`	Date or datetime variable to use for the episode calculation. Supply the variable using `tidyselect` notation.
`time_by`	Time units used to calculate episode flags. If `time_by` is `NULL` then a heuristic will try and estimate the highest order time unit associated with the time variable. If specified, then by must be one of the three: string, specifying either the unit or the number and unit, e.g `time_by = "days"` or `time_by = "2 weeks"` named list of length one, the unit being the name, and the number the value of the list, e.g. `list("days" = 7)`. For the vectorized time functions, you can supply multiple values, e.g. `list("days" = 1:10)`. Numeric vector. If by is a numeric vector and x is not a date/datetime, then arithmetic is used, e.g `time_by = 1`.
`window`	Single number defining the episode threshold. When `rolling = TRUE` events with a `t_elapsed >= window` since the last event are defined as a new episode. When `rolling = FALSE` events with a `t_elapsed >= window` since the first event of the corresponding episode are defined as a new episode. By default, `window = 1` which assigns every event to a new episode.
`roll_episode`	Logical. Should episodes be calculated using a rolling or fixed window? If `TRUE` (the default), an amount of time must have passed (`⁠>= window⁠`) since the last event, with each new event effectively resetting the time at which you start counting. If `FALSE`, the elapsed time is fixed and new episodes are defined based on how much cumulative time has passed since the first event of each episode.
`switch_on_boundary`	When an exact amount of time (specified in `time_by`) has passed, should there be an increment in ID? The default is `TRUE`. For example, if `time_by = "days"` and `switch_on_boundary = FALSE`, `⁠> 1⁠` day must have passed, otherwise `⁠>= 1⁠` day must have passed.
`fill`	Value to fill first time elapsed value. Only applicable when `roll_episode = TRUE`. Default is `0`.
`.add`	Should episodic variables be added to the data? If `FALSE` (the default), then only the relevant variables are returned. If `TRUE`, the episodic variables are added to the original data. In both cases, the order of the data is unchanged.
`event`	(Optional) List that encodes which rows are events, and which aren't. By default `time_episodes()` assumes every observation (row) is an event but this need not be the case. `event` must be a named list of length 1 where the values of the list element represent the event. For example, if your events were coded as `0` and `1` in a variable named "evt" where `1` represents the event, you would supply `event = list(evt = 1)`.
`time_type`	Time type, either "auto", "duration" or "period". With larger data, it is recommended to use `time_type = "duration"` for speed and efficiency.
`.by`	(Optional). A selection of columns to group by for this operation. Columns are specified using `tidyselect`.

Details

time_episodes() calculates the time elapsed (rolling or fixed) between successive events, and flags these events as episodes or not based on how much time has passed.

An example of episodic analysis can include disease infections over time.

In this example, a positive test result represents an event and
a new infection represents a new episode.

It is assumed that after a pre-determined amount of time, a positive result represents a new episode of infection.

To perform simple time-since-event analysis, which means one is not interested in episodes, simply use time_elapsed() instead.

To find implicit missing gaps in time, set window to 1 and switch_on_boundary to FALSE. Any event classified as an episode in this scenario is an event following a gap in time.

The data are always sorted before calculation and then sorted back to the input order.

4 Key variables will be calculated:

ep_id - An integer variable signifying which episode each event belongs to.
Non-events are assigned NA.
ep_id is an increasing integer starting at 1. In the infections scenario, 1 are positives within the first episode of infection, 2 are positives within the second episode of infection and so on.
ep_id_new - An integer variable signifying the first instance of each new episode. This is an increasing integer where 0 signifies within-episode observations and >= 1 signifies the first instance of the respective episode.
t_elapsed - The time elapsed since the last event.
When roll_episode = FALSE, this becomes the time elapsed since the first event of the current episode. Time units are specified in the by argument.
ep_start - Start date/datetime of the episode.

data.table and collapse are used for speed and efficiency.

Value

A data.frame in the same order as it was given.

Examples

library(timeplyr)
library(dplyr)
library(nycflights13)
library(lubridate)
library(ggplot2)

# Say we want to flag origin-destination pairs
# that haven't seen departures or arrivals for a week

events <- flights %>%
  mutate(date = as_date(time_hour)) %>%
  group_by(origin, dest) %>%
  time_episodes(date, time_by = "week", window = 1)

# The pooled average time between flights of a specific origin and destination
# is ~ 5.2 hours
# This average is a weighted average of average time between events
# Weighted by the frequency of origin-destination groups (pairs)

# It can be calculated like so:
# flights %>%
#   arrange(origin, dest, time_hour) %>%
#   group_by(origin, dest) %>%
#   mutate(time_diff = time_diff(lag(time_hour), time_hour, "hours")) %>%
#   summarise(n = n(),
#             mean = mean(time_diff, na.rm = TRUE)) %>%
#   ungroup() %>%
#   summarise(pooled_mean = weighted.mean(mean, n, na.rm = TRUE))

events

episodes <- events %>%
  filter(ep_id_new > 1)
nrow(fdistinct(episodes, origin, dest)) # 55 origin-destinations

# As expected summer months saw the least number of
# dry-periods
episodes %>%
  ungroup() %>%
  time_by(ep_start, time_by = "week",
          .name = "ep_start") %>%
  count() %>%
  ggplot(aes(x = ep_start, y = n)) +
  geom_bar(stat = "identity")

[Package timeplyr version 0.8.1 Index]