multivar_match {fedmatch} | R Documentation |
Matching by computing multivar_scores based on several variables
Description
multivar_match
computes a multivar_score between each pair of observations between
datasets x and y using several variables, then executes a merge by picking the
highest multivar_score pair for each observation in x.
Usage
multivar_match(
data1,
data2,
by = NULL,
by.x = NULL,
by.y = NULL,
unique_key_1,
unique_key_2,
logit = NULL,
missing = FALSE,
wgts = NULL,
compare_type = "diff",
blocks = NULL,
blocks.x = NULL,
blocks.y = NULL,
nthread = 1,
top = 1,
threshold = NULL,
suffixes = c("_1", "_2")
)
Arguments
data1 |
data.frame. First to-merge dataset. |
data2 |
data.frame. Second to-merge dataset. |
by |
character string. Variables to merge on (common across data 1 and data 2). See |
by.x |
character string. Variable to merge on in data1. See |
by.y |
character string. Variable to merge on in data2. See |
unique_key_1 |
character vector. Primary key of data1 that uniquely identifies each row (can be multiple fields) |
unique_key_2 |
character vector. Primary key of data2 that uniquely identifies each row (can be multiple fields) |
logit |
a glm or lm model as a result from a logit regression on a verified dataset. See details. |
missing |
boolean T/F, whether or not to treat missing (NA) observations as its own binary column for each column in by. See details. |
wgts |
rather than a lm model, you can supply weights to calculate multivar_score. Can be weights from |
compare_type |
a vector with the same length as "by" that describes how to compare the variables. Options are "in", "indicator", "substr", "difference", "ratio", "stringdist", and "wgt_jaccard_dist". See the Multivar Matching Vignette for details. |
blocks |
variable present in both data sets to "block" on before computing scores. multivar_scores will only be computed for observations that share a block. See details. |
blocks.x |
name of blocking variables in x. cannot supply both blocks and blocks.x |
blocks.y |
name of blocking variables in y. cannot supply both blocks and blocks.y |
nthread |
integer. Number of cores to use when computing all combinations. See |
top |
integer. Number of matches to return for each observation. |
threshold |
numeric. Minimum score for a match to be included in the result. |
suffixes |
see |
Details
The best way to understand this function is to see the vignette 'Multivar_matching'.
There are two ways of performing this match: either with or without a pre-trained logit.
To use a logit, you must have a verified set of matches. The names of the variables
in this set must match the names of the variables in the data you pass into multivar_match
.
Without a pre-trained logit, you must have a set of weights for each variable that you
want in the comparison. These can either be made up ahead of time, or you can
use a verified set of matches and calculate_weights
.
Value
a data.table, the resultant match, including columns from both data sets.