link_records {diyar}  R Documentation 
Deterministic and probabilistic record linkage with partial or evaluated matches.
link_records(
attribute,
blocking_attribute = NULL,
cmp_func = diyar::exact_match,
attr_threshold = 1,
probabilistic = TRUE,
m_probability = 0.95,
u_probability = NULL,
score_threshold = 1,
repeats_allowed = FALSE,
permutations_allowed = FALSE,
data_source = NULL,
ignore_same_source = TRUE,
display = "none"
)
links_wf_probabilistic(
attribute,
blocking_attribute = NULL,
cmp_func = diyar::exact_match,
attr_threshold = 1,
probabilistic = TRUE,
m_probability = 0.95,
u_probability = NULL,
score_threshold = 1,
id_1 = NULL,
id_2 = NULL,
...
)
prob_score_range(attribute, m_probability = 0.95, u_probability = NULL)
attribute 

blocking_attribute 

cmp_func 

attr_threshold 

probabilistic 

m_probability 

u_probability 

score_threshold 

repeats_allowed 

permutations_allowed 

data_source 

ignore_same_source 

display 

id_1 

id_2 

... 
Arguments passed to 
link_records()
and links_wf_probabilistic()
are functions to implement deterministic, fuzzy or probabilistic record linkage.
link_records()
compares every recordpair in one instance,
while links_wf_probabilistic()
is a wrapper function of links
and so compares batches of recordpairs in iterations.
link_records()
is more thorough in the sense that it compares every combination of recordpairs.
This makes it faster but is memory intensive, particularly if there's no blocking_attribute
.
In contrast, links_wf_probabilistic()
is less memory intensive but takes longer since it does it's checks in batches.
The implementation of probabilistic record linkage is based on Fellegi and Sunter (1969) model for deciding if two records belong to the same entity.
In summary, recordpairs are created and categorised as matches and nonmatches (attr_threshold
) with userdefined functions (cmp_func
).
Two probabilities (m
and u
) are then estimated for each recordpair to score the matches and nonmatches.
The m
probability is the probability that matched records are actually from the same entity i.e. a true match,
while u
probability is the probability that matched records are not from the same entity i.e. a false match.
By default, u
probabilities are calculated as the frequency of each value of an attribute
however,
they can also be supplied along with m
probabilities.
Recordpairs whose total score are above a certain threshold (score_threshold
) are assumed to belong to the same entity.
Agreement (match) and disagreement (nonmatch) scores are calculated as described by Asher et al. (2020).
For each record pair, an agreement for attribute i
is calculated as;
\log_{2}(m_{i}/u_{i})
For each record pair, a disagreement score for attribute i
is calculated as;
\log_{2}((1m_{i})/(1u_{i}))
where m_{i}
and u_{i}
are the m
and u
probabilities for each value of attribute i
.
Note that each probability is calculated as a combined probability for the record pair.
For example, if the values of the recordpair have u
probabilities of 0.1
and 0.2
respectively,
then the u
probability for the pair will be 0.02
.
Missing data (NA
) are considered nonmatches and assigned a u
probability of 0
.
By default, matches and nonmatches for each attribute
are determined as an exact_match
with a binary outcome.
Alternatively, userdefined functions (cmp_func
) are used to create similarity scores.
Pairs with similarity scores within (attr_threshold
) are then considered matches for the corresponding attribute
.
If probabilistic
is FALSE
,
the sum of all similarity scores is used as the score_threshold
instead of deriving one from the m
and u
probabilities.
A blocking_attribute
can be used to reduce the processing time by restricting comparisons to subsets of the dataset.
In link_records()
, score_threshold
is a convenience argument because every combination of recordpairs are returned
therefore, a new score_threshold
can be selected after reviewing the final scores.
However, in links_wf_probabilistic()
, the score_threshold
is more important
because a final selection is made at each iteration.
As a result, links_wf_probabilistic()
requires an acceptable score_threshold
in advance.
To help with this, prob_score_range()
can be used to return the range of scores attainable for a given set of attribute
, m
and u
probabilities.
Additionally, id_1
and id_2
can be used to link specific records pairs, aiding the review of potential scores.
pid
; list
Fellegi, I. P., & Sunter, A. B. (1969). A Theory for Record Linkage. Journal of the Statistical Association, 64(328), 1183–1210. https://doi.org/10.1080/01621459.1969.10501049
Asher, J., Resnick, D., Brite, J., Brackbill, R., & Cone, J. (2020). An Introduction to Probabilistic Record Linkage with a Focus on Linkage Processing for WTC Registries. International journal of environmental research and public health, 17(18), 6937. https://doi.org/10.3390/ijerph17186937.
# Deterministic linkage
dfr < missing_staff_id[c(2, 4, 5, 6)]
link_records(dfr, attr_threshold = 1, probabilistic = FALSE, score_threshold = 2)
links_wf_probabilistic(dfr, attr_threshold = 1, probabilistic = FALSE,
score_threshold = 2, recursive = TRUE)
# Probabilistic linkage
prob_score_range(dfr)
link_records(dfr, attr_threshold = 1, probabilistic = TRUE, score_threshold = 16)
links_wf_probabilistic(dfr, attr_threshold = 1, probabilistic = TRUE,
score_threshold = 16, recursive = TRUE)
# Using string comparators
# For example, matching last word in `hair_colour` and `branch_office`
last_word_wf < function(x) tolower(gsub("^.* ", "", x))
last_word_cmp < function(x, y) last_word_wf(x) == last_word_wf(y)
link_records(dfr, attr_threshold = 1,
cmp_func = c(diyar::exact_match,
diyar::exact_match,
last_word_cmp,
last_word_cmp),
score_threshold = 4)
links_wf_probabilistic(dfr, attr_threshold = 1,
cmp_func = c(diyar::exact_match,
diyar::exact_match,
last_word_cmp,
last_word_cmp),
score_threshold = 4,
recursive = TRUE)