fuzzy_match {fedmatch} | R Documentation |
Use string distances to match on names
Description
Use the stringdist
package to perform a fuzzy match on two datasets.
Usage
fuzzy_match(
data1,
data2,
by = NULL,
by.x = NULL,
by.y = NULL,
suffixes,
unique_key_1,
unique_key_2,
fuzzy_settings = list(method = "jw", p = 0.1, maxDist = 0.05, matchNA = FALSE, nthread
= getOption("sd_num_thread"))
)
Arguments
data1 |
data.frame. First to-merge dataset. |
data2 |
data.frame. Second to-merge dataset. |
by |
character string. Variables to merge on (common across data 1 and data 2). See |
by.x |
character string. Variable to merge on in data1. See |
by.y |
character string. Variable to merge on in data2. See |
suffixes |
character vector with length==2. Suffix to add to like named variables after the merge. See |
unique_key_1 |
character vector. Primary key of data1 that uniquely identifies each row (can be multiple fields) |
unique_key_2 |
character vector. Primary key of data2 that uniquely identifies each row (can be multiple fields) |
fuzzy_settings |
list of arguments to pass to to the fuzzy matching function. See |
Details
stringdist
amatch
computes string distances between every
pair of strings in two vectors, then picks the closest string pair for each
observation in the dataset. This is used by fuzzy_match
to perform
a string distance-based match between two datasets. This process can take quite a long time,
for quicker matches try adjusting the nthread
argument in fuzzy_settings
.
The default fuzzy_settings are sensible starting points for company name matching,
but adjusting these can greatly change how the match performs.
Value
a data.table, the resultant merged data set, including all columns from both data sets.