R: Fuzzy anti-join using minihashing

jaccard_anti_join {zoomerjoin}

R Documentation

Fuzzy anti-join using minihashing

Description

Fuzzy anti-join using minihashing

Usage

jaccard_anti_join(
  a,
  b,
  by = NULL,
  block_by = NULL,
  n_gram_width = 2,
  n_bands = 50,
  band_width = 8,
  threshold = 0.7,
  progress = FALSE,
  clean = FALSE,
  similarity_column = NULL
)

Arguments

`a`	the first dataframe you wish to join.
`b`	the second dataframe you wish to join.
`by`	a named vector indicating which columns to join on. Format should be the same as dplyr: `by = c("column_name_in_df_a" = "column_name_in_df_b")`, but two columns must be specified in each dataset (x column and y column). Specification made with `dplyr::join_by()` are also accepted.
`block_by`	a named vector indicating which column to block on, such that rows that disagree on this field cannot be considered a match. Format should be the same as dplyr: `by = c("column_name_in_df_a" = "column_name_in_df_b")`
`n_gram_width`	the length of the n_grams used in calculating the jaccard similarity. For best performance, I set this large enough that the chance any string has a specific n_gram is low (i.e. `n_gram_width` = 2 or 3 when matching on first names, 5 or 6 when matching on entire sentences).
`n_bands`	the number of bands used in the minihash algorithm (default is 40). Use this in conjunction with the `band_width` to determine the performance of the hashing. The default settings are for a (.2,.8,.001,.999)-sensitive hash i.e. that pairs with a similarity of less than .2 have a >.1% chance of being compared, while pairs with a similarity of greater than .8 have a >99.9% chance of being compared.
`band_width`	the length of each band used in the minihashing algorithm (default is 8) Use this in conjunction with the `n_bands` to determine the performance of the hashing. The default settings are for a (.2,.8,.001,.999)-sensitive hash i.e. that pairs with a similarity of less than .2 have a >.1% chance of being compared, while pairs with a similarity of greater than .8 have a >99.9% chance of being compared.
`threshold`	the jaccard similarity threshold above which two strings should be considered a match (default is .95). The similarity is euqal to 1 the jaccard distance between the two strings, so 1 implies the strings are identical, while a similarity of zero implies the strings are completely dissimilar.
`progress`	set to `TRUE` to print progress
`clean`	should the strings that you fuzzy join on be cleaned (coerced to lower-case, stripped of punctuation and spaces)? Default is FALSE
`similarity_column`	an optional character vector. If provided, the data frame will contain a column with this name giving the jaccard similarity between the two fields. Extra column will not be present if anti-joining.

Value

a tibble fuzzily-joined on the basis of the variables in by. Tries to adhere to the same standards as the dplyr-joins, and uses the same logical joining patterns (i.e. inner-join joins and keeps only observations in both datasets).

Examples

# load baby names data
#install.packages("babynames")
library(babynames)

baby_names <- data.frame(name = tolower(unique(babynames$name))[1:500])
baby_names_sans_vowels <- data.frame(
                name_wo_vowels =gsub("[aeiouy]","", baby_names$name)
   )
# Check the probability two pairs of strings with
# similarity .8 will be matched with a band width of 30
# and 30 bands using the `jaccard_probability()` function:
jaccard_probability(.8,30,8)
# Run the join:
joined_names <- jaccard_anti_join(
              baby_names,
              baby_names_sans_vowels,
              by = c("name"= "name_wo_vowels"),
              threshold = .8,
              n_bands = 20,
              band_width = 6,
              n_gram_width = 1,
              clean = FALSE # default
              )
joined_names

[Package zoomerjoin version 0.1.4 Index]