after_join {funneljoin} | R Documentation |
Join tables based on one event happening after another
Description
Join two tables based on observations in one table happening after
observations in the other. Each table must have a user_id column,
which must always match for two observations to be joined,
and a time column, which must be greater in y
than in x
for
the two to be joined.
Supports all types of dplyr joins (inner, left, anti, etc.) and requires a
type argument to specify which observations in a funnel get kept
(see details for supported types).
Usage
after_join(
x,
y,
by_time,
by_user,
mode = "inner",
type = "first-first",
max_gap = NULL,
min_gap = NULL,
gap_col = FALSE,
suffix = c(".x", ".y")
)
after_inner_join(
x,
y,
by_time,
by_user,
type,
max_gap = NULL,
min_gap = NULL,
gap_col = FALSE,
suffix = c(".x", ".y")
)
after_left_join(
x,
y,
by_time,
by_user,
type,
max_gap = NULL,
min_gap = NULL,
gap_col = FALSE,
suffix = c(".x", ".y")
)
after_right_join(
x,
y,
by_time,
by_user,
type,
max_gap = NULL,
min_gap = NULL,
gap_col = FALSE,
suffix = c(".x", ".y")
)
after_full_join(
x,
y,
by_time,
by_user,
type,
max_gap = NULL,
min_gap = NULL,
gap_col = FALSE,
suffix = c(".x", ".y")
)
after_anti_join(
x,
y,
by_time,
by_user,
type,
max_gap = NULL,
min_gap = NULL,
gap_col = FALSE,
suffix = c(".x", ".y")
)
after_semi_join(
x,
y,
by_time,
by_user,
type,
max_gap = NULL,
min_gap = NULL,
gap_col = FALSE,
suffix = c(".x", ".y")
)
Arguments
x |
A tbl representing the first event to occur in the funnel. |
y |
A tbl representing an event to occur in the funnel. |
by_time |
A character vector to specify the time columns in x and y. This would typically be a datetime or a date column. These columns are used to filter for time y being after time x. |
by_user |
A character vector to specify the user or identity columns in x and y. |
mode |
The method used to join: "inner", "full", "anti", "semi",
"right", "left". Each also has its own function, such as
|
type |
The type of funnel used to distinguish between event pairs, such as "first-first", "last-first", or "any-firstafter". See details for more. |
max_gap |
Optional: the maximum gap allowed between events. Can be a
integer representing the number of seconds or a difftime object, such as
|
min_gap |
Optional: the maximum gap allowed between events. Can be a
integer representing the number of seconds or a difftime object, such as
|
gap_col |
Whether to include a numeric column, |
suffix |
If there are non-joined duplicate variables in x and y, these suffixes will be added to the output to disambiguate them. Should be a character vector of length 2. |
Details
type
can be any combination of first
, last
, any
, lastbefore
, firstwithin
with first
, last
, any
, firstafter
. Some common ones you may use include:
- first-first
Take the earliest x and y for each user before joining. For example, you want the first time someone entered an experiment, followed by the first time someone ever registered. If they registered, entered the experiment, and registered again, you do not want to include that person.
- first-firstafter
Take the first x, then the first y after that. For example, you want when someone first entered an experiment and the first course they started afterwards. You don't care if they started courses before entering the experiment.
- lastbefore-firstafter
First x that's followed by a y before the next x. For example, in last click paid ad attribution, you want the last time someone clicked an ad before the first subscription they did afterward.
- any-firstafter
Take all Xs followed by the first Y after it. For example, you want all the times someone visited a homepage and their first product page they visited afterwards.
- any-any
Take all Xs followed by all Ys. For example, you want all the times someone visited a homepage and all the product pages they saw afterward.
Examples
library(dplyr)
landed <- tribble(
~user_id, ~timestamp,
1, "2018-07-01",
2, "2018-07-01",
2, "2018-07-01",
3, "2018-07-02",
4, "2018-07-01",
4, "2018-07-04",
5, "2018-07-10",
5, "2018-07-12",
6, "2018-07-07",
6, "2018-07-08"
) %>%
mutate(timestamp = as.Date(timestamp))
registered <- tribble(
~user_id, ~timestamp,
1, "2018-07-02",
3, "2018-07-02",
4, "2018-06-10",
4, "2018-07-02",
5, "2018-07-11",
6, "2018-07-10",
6, "2018-07-11",
7, "2018-07-07"
) %>%
mutate(timestamp = as.Date(timestamp))
after_inner_join(landed, registered, by_user = "user_id",
by_time = "timestamp", type = "first-first")
# You can use different methods of joining:
after_left_join(landed, registered, by_user = "user_id",
by_time = "timestamp", type = "first-first")
after_anti_join(landed, registered, by_user = "user_id",
by_time = "timestamp", type = "any-any")