tier_match {fedmatch} | R Documentation |
Perform an iterative match by tier
Description
Constructs a tier_match by running merge_plus
with different parameters sequentially
on the same data. Allows for sequential removal of observations after each tier.
Usage
tier_match(
data1,
data2,
by = NULL,
by.x = NULL,
by.y = NULL,
suffixes = c("_1", "_2"),
check_merge = TRUE,
unique_key_1,
unique_key_2,
tiers = list(),
takeout = "both",
match_type = "exact",
clean = FALSE,
clean_settings = build_clean_settings(),
score_settings = NULL,
filter = NULL,
filter.args = list(),
evaluate = match_evaluate,
evaluate.args = list(),
allow.cartesian = TRUE,
fuzzy_settings = build_fuzzy_settings(),
multivar_settings = build_multivar_settings(),
verbose = FALSE
)
Arguments
data1 |
data.frame. First to-merge dataset. |
data2 |
data.frame. Second to-merge dataset. |
by |
character string. Variables to merge on (common across data 1 and data 2). See |
by.x |
character string. Variable to merge on in data1. See |
by.y |
character string. Variable to merge on in data2. See |
suffixes |
see |
check_merge |
logical. Checks that your unique_keys are indeed unique, and prevents merge from running if merge would result in data.frames larger than 5 million rows |
unique_key_1 |
character vector. Primary key of data1 that uniquely identifies each row (can be multiple fields) |
unique_key_2 |
character vector. Primary key of data2 that uniquely identifies each row (can be multiple fields) |
tiers |
list(). tier is a list of lists, where each list holds the parameters for creating that tier. All arguments to tier_match listed after this argument can either be supplied directly to tier_match, or indirectly via tiers. |
takeout |
character vector, either 'data1', 'data2', 'both', or 'neither'. Removes observations after each tier from the selected dataset. |
match_type |
string. If 'exact', match is exact, if 'fuzzy', match is fuzzy. |
clean |
Boolean, T/F, whether or not to clean strings prior to the match. |
clean_settings |
list. Settings for string cleaning. See |
score_settings |
list. Settings for post-hoc matchscoring. See |
filter |
function or numeric. Filters a merged data1-data2 dataset. If a function, should take in a data.frame (data1 and data2 merged by name1 and name2) and spit out a trimmed version of the data.frame (fewer rows). Think of this function as applying other conditions to matches, other than a match by name. The first argument of filter should be the data.frame. If numeric, will drop all observations with a matchscore lower than or equal to filter. |
filter.args |
list. Arguments passed to filter, if a function |
evaluate |
Function to evaluate merge_plus output. see |
evaluate.args |
list. Arguments passed to function specified by evaluate |
allow.cartesian |
whether or not to allow many-many matches, see data.table::merge() |
fuzzy_settings |
additional arguments for amatch, to be used if match_type = 'fuzzy'. Suggested defaults provided. (see amatch, method='jw') |
multivar_settings |
list of settings to go to the multivar match if match_type
== 'multivar'. See |
verbose |
boolean, whether or not to print tier names and time to match each tier as the matching happens. |
Details
See the tier match vignette to get a clear understanding of the tier_match syntax.
Value
list with matches, data1 and data2 minus matches, and match evaluation
See Also
merge_plus clean_strings