calculate_weights {fedmatch} | R Documentation |
Calculate weights for computing matchscore
Description
Calculate weights for comparison variables based on m
and u
probabilities estimated from a verified dataset.
Usage
calculate_weights(
data,
variables,
compare_type = "stringdist",
suffixes = c("_1", "_2"),
non_negative = FALSE
)
Arguments
data |
data.frame. Verified data. Should have all of the variables you want to calculate weights for from both datasets, named the same with data-specific suffixes. |
variables |
character vector of the variable names of the variables you want to calculate weights for. |
compare_type |
character vector. One of 'stringdist' (for string variables) 'ratio','difference' (for numerics) 'indicator' (0-1 dummy indicating if the two are the same),'in' (0-1 dummy indicating if data1 is IN data2), and 'substr' (numeric indicating how many digits are the same.) |
suffixes |
character vector. Suffixes of of the variables that indicate what data they are from. Default is same as the default for base R merge, c('.x','.y') |
non_negative |
logical. Do you want to allow negative weights? |
Details
This function uses the classic Record Linkage methodology first developed by Felligi and Sunter.
See Record Linkage. m
is the
probability of a given link between observations is a true match, while u
is the probability
of an unlinked pair of observations being a true match. calculate_weights
computes a preliminary weight for each variable by computing
w = \log_2 (\frac{m}{u}),
then making these weights sum to 1. Thus, the weights that have higher m
and lower u
probabilities will get higher weights, which makes sense given
the definitions. These weights can then be easily passed into the score_settings
argument of merge_plus
or tier_match
, or into the wgts
argument of
multivar_match
.
Value
list with m probabilities, u probabilities, w weights, and settings, the list argument required as an input for score_settings in merge_plus using the calculate weights.