vec_locate_matches {vctrs} | R Documentation |
Locate observations matching specified conditions
Description
vec_locate_matches()
is a more flexible version of vec_match()
used to
identify locations where each value of needles
matches one or multiple
values in haystack
. Unlike vec_match()
, vec_locate_matches()
returns
all matches by default, and can match on binary conditions other than
equality, such as >
, >=
, <
, and <=
.
Usage
vec_locate_matches(
needles,
haystack,
...,
condition = "==",
filter = "none",
incomplete = "compare",
no_match = NA_integer_,
remaining = "drop",
multiple = "all",
relationship = "none",
nan_distinct = FALSE,
chr_proxy_collate = NULL,
needles_arg = "needles",
haystack_arg = "haystack",
error_call = current_env()
)
Arguments
needles , haystack |
Vectors used for matching.
Prior to comparison, |
... |
These dots are for future extensions and must be empty. |
condition |
Condition controlling how
|
filter |
Filter to be applied to the matched results.
Filters don't have any effect on A filter can return multiple haystack matches for a particular needle
if the maximum or minimum haystack value is duplicated in |
incomplete |
Handling of missing and incomplete
values in
|
no_match |
Handling of
|
remaining |
Handling of
|
multiple |
Handling of
|
relationship |
Handling of the expected relationship between
|
nan_distinct |
A single logical specifying whether or not |
chr_proxy_collate |
A function generating an alternate representation of character vectors to use for collation, often used for locale-aware ordering.
For data frames, Common transformation functions include: |
needles_arg , haystack_arg |
Argument tags for |
error_call |
The execution environment of a currently
running function, e.g. |
Details
vec_match()
is identical to (but often slightly faster than):
vec_locate_matches( needles, haystack, condition = "==", multiple = "first", nan_distinct = TRUE )
vec_locate_matches()
is extremely similar to a SQL join between needles
and haystack
, with the default being most similar to a left join.
Be very careful when specifying match condition
s. If a condition is
misspecified, it is very easy to accidentally generate an exponentially
large number of matches.
Value
A two column data frame containing the locations of the matches.
-
needles
is an integer vector containing the location of the needle currently being matched. -
haystack
is an integer vector containing the location of the corresponding match in the haystack for the current needle.
Dependencies of vec_locate_matches()
Examples
x <- c(1, 2, NA, 3, NaN)
y <- c(2, 1, 4, NA, 1, 2, NaN)
# By default, for each value of `x`, all matching locations in `y` are
# returned
matches <- vec_locate_matches(x, y)
matches
# The result can be used to slice the inputs to align them
data_frame(
x = vec_slice(x, matches$needles),
y = vec_slice(y, matches$haystack)
)
# If multiple matches are present, control which is returned with `multiple`
vec_locate_matches(x, y, multiple = "first")
vec_locate_matches(x, y, multiple = "last")
vec_locate_matches(x, y, multiple = "any")
# Use `relationship` to add constraints and error on multiple matches if
# they aren't expected
try(vec_locate_matches(x, y, relationship = "one-to-one"))
# In this case, the `NA` in `y` matches two rows in `x`
try(vec_locate_matches(x, y, relationship = "one-to-many"))
# By default, `NA` is treated as being identical to `NaN`.
# Using `nan_distinct = TRUE` treats `NA` and `NaN` as different values, so
# `NA` can only match `NA`, and `NaN` can only match `NaN`.
vec_locate_matches(x, y, nan_distinct = TRUE)
# If you never want missing values to match, set `incomplete = NA` to return
# `NA` in the `haystack` column anytime there was an incomplete value
# in `needles`.
vec_locate_matches(x, y, incomplete = NA)
# Using `incomplete = NA` allows us to enforce the one-to-many relationship
# that we couldn't before
vec_locate_matches(x, y, relationship = "one-to-many", incomplete = NA)
# `no_match` allows you to specify the returned value for a needle with
# zero matches. Note that this is different from an incomplete value,
# so specifying `no_match` allows you to differentiate between incomplete
# values and unmatched values.
vec_locate_matches(x, y, incomplete = NA, no_match = 0L)
# If you want to require that every `needle` has at least 1 match, set
# `no_match` to `"error"`:
try(vec_locate_matches(x, y, incomplete = NA, no_match = "error"))
# By default, `vec_locate_matches()` detects equality between `needles` and
# `haystack`. Using `condition`, you can detect where an inequality holds
# true instead. For example, to find every location where `x[[i]] >= y`:
matches <- vec_locate_matches(x, y, condition = ">=")
data_frame(
x = vec_slice(x, matches$needles),
y = vec_slice(y, matches$haystack)
)
# You can limit which matches are returned with a `filter`. For example,
# with the above example you can filter the matches returned by `x[[i]] >= y`
# down to only the ones containing the maximum `y` value of those matches.
matches <- vec_locate_matches(x, y, condition = ">=", filter = "max")
# Here, the matches for the `3` needle value have been filtered down to
# only include the maximum haystack value of those matches, `2`. This is
# often referred to as a rolling join.
data_frame(
x = vec_slice(x, matches$needles),
y = vec_slice(y, matches$haystack)
)
# In the very rare case that you need to generate locations for a
# cross match, where every value of `x` is forced to match every
# value of `y` regardless of what the actual values are, you can
# replace `x` and `y` with integer vectors of the same size that contain
# a single value and match on those instead.
x_proxy <- vec_rep(1L, vec_size(x))
y_proxy <- vec_rep(1L, vec_size(y))
nrow(vec_locate_matches(x_proxy, y_proxy))
vec_size(x) * vec_size(y)
# By default, missing values will match other missing values when using
# `==`, `>=`, or `<=` conditions, but not when using `>` or `<` conditions.
# This is similar to how `vec_compare(x, y, na_equal = TRUE)` works.
x <- c(1, NA)
y <- c(NA, 2)
vec_locate_matches(x, y, condition = "<=")
vec_locate_matches(x, y, condition = "<")
# You can force missing values to match regardless of the `condition`
# by using `incomplete = "match"`
vec_locate_matches(x, y, condition = "<", incomplete = "match")
# You can also use data frames for `needles` and `haystack`. The
# `condition` will be recycled to the number of columns in `needles`, or
# you can specify varying conditions per column. In this example, we take
# a vector of date `values` and find all locations where each value is
# between lower and upper bounds specified by the `haystack`.
values <- as.Date("2019-01-01") + 0:9
needles <- data_frame(lower = values, upper = values)
set.seed(123)
lower <- as.Date("2019-01-01") + sample(10, 10, replace = TRUE)
upper <- lower + sample(3, 10, replace = TRUE)
haystack <- data_frame(lower = lower, upper = upper)
# (values >= lower) & (values <= upper)
matches <- vec_locate_matches(needles, haystack, condition = c(">=", "<="))
data_frame(
lower = vec_slice(lower, matches$haystack),
value = vec_slice(values, matches$needle),
upper = vec_slice(upper, matches$haystack)
)