time_episodes {timeplyr}R Documentation

Episodic calculation of time-since-event data

Description

This function assigns episodes to events based on a pre-defined threshold of a chosen time unit.

Usage

time_episodes(
  data,
  time,
  time_by = NULL,
  window = 1,
  roll_episode = TRUE,
  switch_on_boundary = TRUE,
  fill = 0,
  .add = FALSE,
  event = NULL,
  time_type = getOption("timeplyr.time_type", "auto"),
  .by = NULL
)

Arguments

data

A data frame.

time

Date or datetime variable to use for the episode calculation. Supply the variable using tidyselect notation.

time_by

Time units used to calculate episode flags. If time_by is NULL then a heuristic will try and estimate the highest order time unit associated with the time variable. If specified, then by must be one of the three:

  • string, specifying either the unit or the number and unit, e.g time_by = "days" or time_by = "2 weeks"

  • named list of length one, the unit being the name, and the number the value of the list, e.g. list("days" = 7). For the vectorized time functions, you can supply multiple values, e.g. list("days" = 1:10).

  • Numeric vector. If by is a numeric vector and x is not a date/datetime, then arithmetic is used, e.g time_by = 1.

window

Single number defining the episode threshold. When rolling = TRUE events with a t_elapsed >= window since the last event are defined as a new episode.
When rolling = FALSE events with a t_elapsed >= window since the first event of the corresponding episode are defined as a new episode.
By default, window = 1 which assigns every event to a new episode.

roll_episode

Logical. Should episodes be calculated using a rolling or fixed window? If TRUE (the default), an amount of time must have passed (⁠>= window⁠) since the last event, with each new event effectively resetting the time at which you start counting.
If FALSE, the elapsed time is fixed and new episodes are defined based on how much cumulative time has passed since the first event of each episode.

switch_on_boundary

When an exact amount of time (specified in time_by) has passed, should there be an increment in ID?
The default is TRUE.
For example, if time_by = "days" and switch_on_boundary = FALSE, ⁠> 1⁠ day must have passed, otherwise ⁠>= 1⁠ day must have passed.

fill

Value to fill first time elapsed value. Only applicable when roll_episode = TRUE.
Default is 0.

.add

Should episodic variables be added to the data?
If FALSE (the default), then only the relevant variables are returned.
If TRUE, the episodic variables are added to the original data. In both cases, the order of the data is unchanged.

event

(Optional) List that encodes which rows are events, and which aren't. By default time_episodes() assumes every observation (row) is an event but this need not be the case.
event must be a named list of length 1 where the values of the list element represent the event. For example, if your events were coded as 0 and 1 in a variable named "evt" where 1 represents the event, you would supply event = list(evt = 1).

time_type

Time type, either "auto", "duration" or "period". With larger data, it is recommended to use time_type = "duration" for speed and efficiency.

.by

(Optional). A selection of columns to group by for this operation. Columns are specified using tidyselect.

Details

time_episodes() calculates the time elapsed (rolling or fixed) between successive events, and flags these events as episodes or not based on how much time has passed.

An example of episodic analysis can include disease infections over time.

In this example, a positive test result represents an event and
a new infection represents a new episode.

It is assumed that after a pre-determined amount of time, a positive result represents a new episode of infection.

To perform simple time-since-event analysis, which means one is not interested in episodes, simply use time_elapsed() instead.

To find implicit missing gaps in time, set window to 1 and switch_on_boundary to FALSE. Any event classified as an episode in this scenario is an event following a gap in time.

The data are always sorted before calculation and then sorted back to the input order.

4 Key variables will be calculated:

data.table and collapse are used for speed and efficiency.

Value

A data.frame in the same order as it was given.

See Also

time_elapsed time_seq_id

Examples

library(timeplyr)
library(dplyr)
library(nycflights13)
library(lubridate)
library(ggplot2)

# Say we want to flag origin-destination pairs
# that haven't seen departures or arrivals for a week

events <- flights %>%
  mutate(date = as_date(time_hour)) %>%
  group_by(origin, dest) %>%
  time_episodes(date, time_by = "week", window = 1)

# The pooled average time between flights of a specific origin and destination
# is ~ 5.2 hours
# This average is a weighted average of average time between events
# Weighted by the frequency of origin-destination groups (pairs)

# It can be calculated like so:
# flights %>%
#   arrange(origin, dest, time_hour) %>%
#   group_by(origin, dest) %>%
#   mutate(time_diff = time_diff(lag(time_hour), time_hour, "hours")) %>%
#   summarise(n = n(),
#             mean = mean(time_diff, na.rm = TRUE)) %>%
#   ungroup() %>%
#   summarise(pooled_mean = weighted.mean(mean, n, na.rm = TRUE))

events

episodes <- events %>%
  filter(ep_id_new > 1)
nrow(fdistinct(episodes, origin, dest)) # 55 origin-destinations

# As expected summer months saw the least number of
# dry-periods
episodes %>%
  ungroup() %>%
  time_by(ep_start, time_by = "week",
          .name = "ep_start") %>%
  count() %>%
  ggplot(aes(x = ep_start, y = n)) +
  geom_bar(stat = "identity")


[Package timeplyr version 0.8.1 Index]