euclidean_anti_join {zoomerjoin} | R Documentation |
Fuzzy joins for Euclidean distance using Locality Sensitive Hashing
Description
Fuzzy joins for Euclidean distance using Locality Sensitive Hashing
Usage
euclidean_anti_join(
a,
b,
by = NULL,
threshold = 1,
n_bands = 30,
band_width = 5,
r = 0.5,
progress = FALSE
)
euclidean_inner_join(
a,
b,
by = NULL,
threshold = 1,
n_bands = 30,
band_width = 5,
r = 0.5,
progress = FALSE
)
euclidean_left_join(
a,
b,
by = NULL,
threshold = 1,
n_bands = 30,
band_width = 5,
r = 0.5,
progress = FALSE
)
euclidean_right_join(
a,
b,
by = NULL,
threshold = 1,
n_bands = 30,
band_width = 5,
r = 0.5,
progress = FALSE
)
euclidean_full_join(
a,
b,
by = NULL,
threshold = 1,
n_bands = 30,
band_width = 5,
r = 0.5,
progress = FALSE
)
Arguments
a , b |
The two dataframes to join. |
by |
A named vector indicating which columns to join on. Format should
be the same as dplyr: |
threshold |
The distance threshold below which units should be considered a match. Note that contrary to Jaccard joins, this value is about the distance and not the similarity. Therefore, a lower value means a higher similarity. |
n_bands |
The number of bands used in the minihash algorithm (default is
40). Use this in conjunction with the |
band_width |
The length of each band used in the minihashing algorithm
(default is 8) Use this in conjunction with the |
r |
Hyperparameter used to govern the sensitivity of the locality
sensitive hash. Corresponds to the width of the hash bucket in the LSH
algorithm. Increasing values of |
progress |
Set to |
Value
A tibble fuzzily-joined on the basis of the variables in by.
Tries
to adhere to the same standards as the dplyr-joins, and uses the same
logical joining patterns (i.e. inner-join joins and keeps only observations
in both datasets).
References
Datar, Mayur, Nicole Immorlica, Pitor Indyk, and Vahab Mirrokni. "Locality-Sensitive Hashing Scheme Based on p-Stable Distributions" SCG '04: Proceedings of the twentieth annual symposium on Computational geometry (2004): 253-262
Examples
n <- 10
# Build two matrices that have close values
X_1 <- matrix(c(seq(0, 1, 1 / (n - 1)), seq(0, 1, 1 / (n - 1))), nrow = n)
X_2 <- X_1 + .0000001
X_1 <- as.data.frame(X_1)
X_2 <- as.data.frame(X_2)
X_1$id_1 <- 1:n
X_2$id_2 <- 1:n
# only keep observations that have a match
euclidean_inner_join(X_1, X_2, by = c("V1", "V2"), threshold = .00005)
# keep all observations from X_1, regardless of whether they have a match
euclidean_inner_join(X_1, X_2, by = c("V1", "V2"), threshold = .00005)