calculate_weights {fedmatch} | R Documentation |
Calculate weights for computing matchscore
Description
Calculate weights for comparison variables based on and
probabilities estimated from a verified dataset.
Usage
calculate_weights(
data,
variables,
compare_type = "stringdist",
suffixes = c("_1", "_2"),
non_negative = FALSE
)
Arguments
data |
data.frame. Verified data. Should have all of the variables you want to calculate weights for from both datasets, named the same with data-specific suffixes. |
variables |
character vector of the variable names of the variables you want to calculate weights for. |
compare_type |
character vector. One of 'stringdist' (for string variables) 'ratio','difference' (for numerics) 'indicator' (0-1 dummy indicating if the two are the same),'in' (0-1 dummy indicating if data1 is IN data2), and 'substr' (numeric indicating how many digits are the same.) |
suffixes |
character vector. Suffixes of of the variables that indicate what data they are from. Default is same as the default for base R merge, c('.x','.y') |
non_negative |
logical. Do you want to allow negative weights? |
Details
This function uses the classic Record Linkage methodology first developed by Felligi and Sunter.
See Record Linkage. is the
probability of a given link between observations is a true match, while
is the probability
of an unlinked pair of observations being a true match.
calculate_weights
computes a preliminary weight for each variable by computing
then making these weights sum to 1. Thus, the weights that have higher
and lower
probabilities will get higher weights, which makes sense given
the definitions. These weights can then be easily passed into the
score_settings
argument of merge_plus
or tier_match
, or into the wgts
argument of
multivar_match
.
Value
list with m probabilities, u probabilities, w weights, and settings, the list argument required as an input for score_settings in merge_plus using the calculate weights.