hamming_inner_join {zoomerjoin} | R Documentation |
Fuzzy joins for Hamming distance using Locality Sensitive Hashing
Description
Find similar rows between two tables using the hamming distance. The hamming distance is equal to the number characters two strings differ by, or is equal to infinity if two strings are of different lengths
Usage
hamming_inner_join(
a,
b,
by = NULL,
n_bands = 100,
band_width = 8,
threshold = 2,
progress = FALSE,
clean = FALSE,
similarity_column = NULL
)
hamming_anti_join(
a,
b,
by = NULL,
n_bands = 100,
band_width = 100,
threshold = 2,
progress = FALSE,
clean = FALSE,
similarity_column = NULL
)
hamming_left_join(
a,
b,
by = NULL,
n_bands = 100,
band_width = 100,
threshold = 2,
progress = FALSE,
clean = FALSE,
similarity_column = NULL
)
hamming_right_join(
a,
b,
by = NULL,
n_bands = 100,
band_width = 100,
threshold = 2,
progress = FALSE,
clean = FALSE,
similarity_column = NULL
)
hamming_full_join(
a,
b,
by = NULL,
n_bands = 100,
band_width = 100,
threshold = 2,
progress = FALSE,
clean = FALSE,
similarity_column = NULL
)
Arguments
a , b |
The two dataframes to join. |
by |
A named vector indicating which columns to join on. Format should
be the same as dplyr: |
n_bands |
The number of bands used in the locality sensitive hashing
algorithm (default is 100). Use this in conjunction with the
|
band_width |
The length of each band used in the minihashing algorithm
(default is 8). Use this in conjunction with the |
threshold |
The Hamming distance threshold below which two strings should be considered a match. A distance of zero corresponds to complete equality between strings, while a distance of 'x' between two strings means that 'x' substitutions must be made to transform one string into the other. |
progress |
Set to |
clean |
Should the strings that you fuzzy join on be cleaned (coerced to
lower-case, stripped of punctuation and spaces)? Default is |
similarity_column |
An optional character vector. If provided, the data frame will contain a column with this name giving the Hamming distance between the two fields. Extra column will not be present if anti-joining. |
Value
A tibble fuzzily-joined on the basis of the variables in by.
Tries
to adhere to the same standards as the dplyr-joins, and uses the same
logical joining patterns (i.e. inner-join joins and keeps only observations
in both datasets).
Examples
# load baby names data
# install.packages("babynames")
library(babynames)
baby_names <- data.frame(name = tolower(unique(babynames$name))[1:500])
baby_names_mispelled <- data.frame(
name_mispelled = gsub("[aeiouy]", "x", baby_names$name)
)
# Run the join and only keep rows that have a match:
hamming_inner_join(
baby_names,
baby_names_mispelled,
by = c("name" = "name_mispelled"),
threshold = 3,
n_bands = 150,
band_width = 10,
clean = FALSE # default
)
# Run the join and keep all rows from the first dataset, regardless of whether
# they have a match:
hamming_left_join(
baby_names,
baby_names_mispelled,
by = c("name" = "name_mispelled"),
threshold = 3,
n_bands = 150,
band_width = 10,
)