link {mpathsenser} | R Documentation |
Link y to the time scale of x
Description
One of the key tasks in analysing mobile sensing data is being able to link it to other data.
For example, when analysing physical activity data, it could be of interest to know how much
time a participant spent exercising before or after an ESM beep to evaluate their stress level.
link()
allows you to map two data frames to each other that are on different time scales,
based on a pre-specified offset before and/or after. This function assumes that both x
and
y
have a column called time
containing DateTimeClasses.
Usage
link(
x,
y,
by = NULL,
time,
end_time = NULL,
y_time,
offset_before = 0,
offset_after = 0,
add_before = FALSE,
add_after = FALSE,
name = "data",
split = by
)
Arguments
x , y |
A pair of data frames or data frame extensions (e.g. a tibble). Both |
by |
A character vector indicating the variable(s) to match by, typically the participant
IDs. If NULL, the default, To join by different variables on To join by multiple variables, use a vector with To perform a cross-join (when |
time |
The name of the column containing the timestamps in |
end_time |
Optionally, the name of the column containing the end time in |
y_time |
The name of the column containing the timestamps in |
offset_before |
The time before each measurement in |
offset_after |
The time after each measurement in |
add_before |
Logical value. Do you want to add the last measurement before the start of each interval? |
add_after |
Logical value. Do you want to add the first measurement after the end of each interval? |
name |
The name of the column containing the nested |
split |
An optional grouping variable to split the computation by. When working with large
data sets, the computation can grow so large it no longer fits in your computer's working
memory (after which it will probably fall back on the swap file, which is very slow). Splitting
the computation trades some computational efficiency for a large decrease in RAM usage. This
argument defaults to |
Details
y
is matched to the time scale of x
by means of time windows. These time windows are
defined as the period between x - offset_before
and x + offset_after
. Note that either
offset_before
or offset_after
can be 0, but not both. The "interval" of the measurements is
therefore the associated time window for each measurement of x
and the data of y
that also
falls within this period. For example, an offset_before
of
minutes(30)
means to match all data of y
that occurred before each
measurement in x
. An offset_after
of 900 (i.e. 15 minutes) means to match all data of y
that occurred after each measurement in x
. When both offset_before
and offset_after
are
specified, it means all data of y
is matched in an interval of 30 minutes before and 15
minutes after each measurement of x
, thus combining the two arguments.
The arguments add_before
and add_after
let you decide whether you want to add the last
measurement before the interval and/or the first measurement after the interval respectively.
This could be useful when you want to know which type of event occurred right before or after
the interval of the measurement. For example, at offset_before = "30 minutes"
, the data may
indicate that a participant was running 20 minutes before a measurement in x
, However, with
just that information there is no way of knowing what the participant was doing the first 10
minutes of the interval. The same principle applies to after the interval. When add_before
is
set to TRUE
, the last measurement of y
occurring before the interval of x
is added to the
output data as the first row, having the time
of x - offset_before
(i.e. the start
of the interval). When add_after
is set to TRUE
, the first measurement of y
occurring
after the interval of x
is added to the output data as the last row, having the time
of
x + offset_after
(i.e. the end of the interval). This way, it is easier to calculate the
difference to other measurements of y
later (within the same interval). Additionally, an
extra column (original_time
) is added in the nested data
column, which is the original time
of the y
measurement and NULL
for every other observation. This may be useful to check if
the added measurement isn't too distant (in time) from the others. Note that multiple rows may
be added if there were multiple measurements in y
at exactly the same time. Also, if there
already is a row with a timestamp exactly equal to the start of the interval (for add_before = TRUE
) or to the end of the interval (add_after = TRUE
), no extra row is added.
Value
A tibble with the data of x
with a new column data
with the matched data of y
according to offset_before
and offset_after
.
Warning
Note that setting add_before
and add_after
each add one row to each nested
tibble
of the data
column. Thus, if you are only interested in the total count (e.g.
the number of total screen changes), remember to set these arguments to FALSE or make sure to
filter out rows that do not have an original_time
. Simply subtracting 1 or 2 does not work
as not all measurements in x
may have a measurement in y
before or after (and thus no row
is added).
Examples
# Define some data
x <- data.frame(
time = rep(seq.POSIXt(as.POSIXct("2021-11-14 13:00:00"), by = "1 hour", length.out = 3), 2),
participant_id = c(rep("12345", 3), rep("23456", 3)),
item_one = rep(c(40, 50, 60), 2)
)
# Define some data that we want to link to x
y <- data.frame(
time = rep(seq.POSIXt(as.POSIXct("2021-11-14 12:50:00"), by = "5 min", length.out = 30), 2),
participant_id = c(rep("12345", 30), rep("23456", 30)),
x = rep(1:30, 2)
)
# Now link y within 30 minutes before each row in x
# until the measurement itself:
link(
x = x,
y = y,
by = "participant_id",
time = time,
y_time = time,
offset_before = "30 minutes"
)
# We can also link y to a period both before and after
# each measurement in x.
# Also note that time, end_time and y_time accept both
# quoted names as well as character names.
link(
x = x,
y = y,
by = "participant_id",
time = "time",
y_time = "time",
offset_before = "15 minutes",
offset_after = "15 minutes"
)
# It can be important to also know the measurements
# just preceding the interval or just after the interval.
# This adds an extra column called 'original_time' in the
# nested data, containing the original time stamp. The
# actual timestamp is set to the start time of the interval.
link(
x = x,
y = y,
by = "participant_id",
time = time,
y_time = time,
offset_before = "15 minutes",
offset_after = "15 minutes",
add_before = TRUE,
add_after = TRUE
)
# If you participant_id is not important to you
# (i.e. the measurements are interchangeable),
# you can ignore them by leaving by empty.
# However, in this case we'll receive a warning
# since x and y have no other columns in common
# (except time, of course). Thus, we can perform
# a cross-join:
link(
x = x,
y = y,
by = character(),
time = time,
y_time = time,
offset_before = "30 minutes"
)
# Alternatively, we can specify custom intervals.
# That is, we can create variable intervals
# without using fixed offsets.
x <- data.frame(
start_time = rep(
x = as.POSIXct(c(
"2021-11-14 12:40:00",
"2021-11-14 13:30:00",
"2021-11-14 15:00:00"
)),
times = 2
),
end_time = rep(
x = as.POSIXct(c(
"2021-11-14 13:20:00",
"2021-11-14 14:10:00",
"2021-11-14 15:30:00"
)),
times = 2
),
participant_id = c(rep("12345", 3), rep("23456", 3)),
item_one = rep(c(40, 50, 60), 2)
)
link(
x = x,
y = y,
by = "participant_id",
time = start_time,
end_time = end_time,
y_time = time,
add_before = TRUE,
add_after = TRUE
)