stringdist_join {fuzzyjoin} | R Documentation |
Join two tables based on fuzzy string matching of their columns
Description
Join two tables based on fuzzy string matching of their columns. This is useful, for example, in matching free-form inputs in a survey or online form, where it can catch misspellings and small personal changes.
Usage
stringdist_join(
x,
y,
by = NULL,
max_dist = 2,
method = c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw",
"soundex"),
mode = "inner",
ignore_case = FALSE,
distance_col = NULL,
...
)
stringdist_inner_join(x, y, by = NULL, distance_col = NULL, ...)
stringdist_left_join(x, y, by = NULL, distance_col = NULL, ...)
stringdist_right_join(x, y, by = NULL, distance_col = NULL, ...)
stringdist_full_join(x, y, by = NULL, distance_col = NULL, ...)
stringdist_semi_join(x, y, by = NULL, distance_col = NULL, ...)
stringdist_anti_join(x, y, by = NULL, distance_col = NULL, ...)
Arguments
x |
A tbl |
y |
A tbl |
by |
Columns by which to join the two tables |
max_dist |
Maximum distance to use for joining |
method |
Method for computing string distance, see
|
mode |
One of "inner", "left", "right", "full" "semi", or "anti" |
ignore_case |
Whether to be case insensitive (default yes) |
distance_col |
If given, will add a column with this name containing the difference between the two |
... |
Arguments passed on to |
Details
If method = "soundex"
, the max_dist
is
automatically set to 0.5, since soundex returns either a 0 (match)
or a 1 (no match).
Examples
library(dplyr)
library(ggplot2)
data(diamonds)
d <- data_frame(approximate_name = c("Idea", "Premiums", "Premioom",
"VeryGood", "VeryGood", "Faiir"),
type = 1:6)
# no matches when they are inner-joined:
diamonds %>%
inner_join(d, by = c(cut = "approximate_name"))
# but we can match when they're fuzzy joined
diamonds %>%
stringdist_inner_join(d, by = c(cut = "approximate_name"))
[Package fuzzyjoin version 0.1.6 Index]