R: Use string distances to match on names

fuzzy_match {fedmatch}

R Documentation

Use string distances to match on names

Description

Use the stringdist package to perform a fuzzy match on two datasets.

Usage

fuzzy_match(
  data1,
  data2,
  by = NULL,
  by.x = NULL,
  by.y = NULL,
  suffixes,
  unique_key_1,
  unique_key_2,
  fuzzy_settings = list(method = "jw", p = 0.1, maxDist = 0.05, matchNA = FALSE, nthread
    = getOption("sd_num_thread"))
)

Arguments

`data1`	data.frame. First to-merge dataset.
`data2`	data.frame. Second to-merge dataset.
`by`	character string. Variables to merge on (common across data 1 and data 2). See `merge`
`by.x`	character string. Variable to merge on in data1. See `merge`
`by.y`	character string. Variable to merge on in data2. See `merge`
`suffixes`	character vector with length==2. Suffix to add to like named variables after the merge. See `merge`
`unique_key_1`	character vector. Primary key of data1 that uniquely identifies each row (can be multiple fields)
`unique_key_2`	character vector. Primary key of data2 that uniquely identifies each row (can be multiple fields)
`fuzzy_settings`	list of arguments to pass to to the fuzzy matching function. See `amatch`.

Details

stringdist amatch computes string distances between every pair of strings in two vectors, then picks the closest string pair for each observation in the dataset. This is used by fuzzy_match to perform a string distance-based match between two datasets. This process can take quite a long time, for quicker matches try adjusting the nthread argument in fuzzy_settings. The default fuzzy_settings are sensible starting points for company name matching, but adjusting these can greatly change how the match performs.

Value

a data.table, the resultant merged data set, including all columns from both data sets.

[Package fedmatch version 2.0.6 Index]