euclidean_right_join {zoomerjoin}R Documentation

Spatial Right Join Using LSH

Description

Spatial Right Join Using LSH

Usage

euclidean_right_join(
  a,
  b,
  by = NULL,
  threshold = 1,
  n_bands = 30,
  band_width = 5,
  r = 0.5,
  progress = FALSE
)

Arguments

a

the first dataframe you wish to join.

b

the second dataframe you wish to join.

by

a named vector indicating which columns to join on. Format should be the same as dplyr: by = c("column_name_in_df_a" = "column_name_in_df_b"), but two columns must be specified in each dataset (x column and y column). Specification made with dplyr::join_by() are also accepted.

threshold

the distance threshold below which units should be considered a match

n_bands

the number of bands used in the LSH algorithm (default is 30). Use this in conjunction with the band_width to determine the performance of the hashing.

band_width

the length of each band used in the minihashing algorithm (default is 5) Use this in conjunction with the n_bands to determine the performance of the hashing.

r

the r hyperparameter used to govern the sensitivity of the locality sensitive hash, as described in

progress

set to TRUE to print progress

Value

a tibble fuzzily-joined on the basis of the variables in by. Tries to adhere to the same standards as the dplyr-joins, and uses the same logical joining patterns (i.e. inner-join joins and keeps only observations in both datasets).

References

Datar, Mayur, Nicole Immorlica, Pitor Indyk, and Vahab Mirrokni. "Locality-Sensitive Hashing Scheme Based on p-Stable Distributions" SCG '04: Proceedings of the twentieth annual symposium on Computational geometry (2004): 253-262

Examples

n <- 10

X_1 <- matrix(c(seq(0,1,1/(n-1)), seq(0,1,1/(n-1))), nrow=n)
X_2 <- X_1 + .0000001
X_1 <- as.data.frame(X_1)
X_2 <- as.data.frame(X_2)

X_1$id_1 <- 1:n
X_2$id_2 <- 1:n

euclidean_right_join(X_1, X_2, by = c("V1", "V2"), threshold =.00005)



[Package zoomerjoin version 0.1.4 Index]