links {diyar} | R Documentation |
Multistage record linkage
Description
Assign records to unique groups based on an ordered set of match criteria.
Usage
links(
criteria,
sub_criteria = NULL,
sn = NULL,
strata = NULL,
data_source = NULL,
data_links = "ANY",
display = "none",
group_stats = FALSE,
expand = TRUE,
shrink = FALSE,
recursive = "none",
check_duplicates = FALSE,
tie_sort = NULL,
batched = "yes",
repeats_allowed = FALSE,
permutations_allowed = FALSE,
ignore_same_source = FALSE
)
Arguments
criteria |
|
sub_criteria |
|
sn |
|
strata |
|
data_source |
|
data_links |
|
display |
|
group_stats |
|
expand |
|
shrink |
|
recursive |
|
check_duplicates |
|
tie_sort |
|
batched |
|
repeats_allowed |
|
permutations_allowed |
|
ignore_same_source |
|
Details
The priority of matches decreases with each subsequent stage of the linkage process.
Therefore, the attributes in criteria
should be in an order of decreasing relevance.
Records with missing data (NA
) for each criteria
are
skipped at the respective stage, while records with
missing data strata
are skipped from every stage.
If a record is skipped from a stage, another attempt will be made to match the record at the next stage. If a record is still unmatched by the last stage, it is assigned a unique group ID.
A sub_criteria
adds nested match criteria
to each stage of the linkage process. If used, only
records with a matching criteria
and sub_criteria
are linked.
In links
, each sub_criteria
must
be linked to a criteria
. This is done by adding each sub_criteria
to a named element of a list - "cr" concatenated with
the corresponding stage's number.
For example, 3 sub_criteria
linked to
criteria
1, 5 and 13 will be;
list(cr1 = sub_criteria(...), cr5 = sub_criteria(...), cr13 = sub_criteria(...))
Any unlinked sub_criteria
will be ignored.
Every element in data_links
must be named "l"
(links) or "g"
(groups).
Unnamed elements of data_links
will be assumed to be "l"
.
If named
"l"
, groups without records from every listeddata_source
will be unlinked.If named
"g"
, groups without records from any listeddata_source
will be unlinked.
See vignette("links")
for more information.
Value
pid
; list
See Also
links_af_probabilistic
; episodes
;
predefined_tests
; sub_criteria
Examples
data(patient_records)
dfr <- patient_records
# An exact match on surname followed by an exact match on forename
stages <- as.list(dfr[c("surname", "forename")])
p1 <- links(criteria = stages)
# An exact match on forename followed by an exact match on surname
p2 <- links(criteria = rev(stages))
# Nested matches
# Same sex OR birth year
m.cri.1 <- sub_criteria(
format(dfr$dateofbirth, "%Y"), dfr$sex,
operator = "or")
# Same middle name AND a 10 year age difference
age_diff <- function(x, y){
diff <- abs(as.numeric(x) - as.numeric(y))
wgt <- diff %in% 0:10 & !is.na(diff)
wgt
}
m.cri.2 <- sub_criteria(
format(dfr$dateofbirth, "%Y"), dfr$middlename,
operator = "and",
match_funcs = c(age_diff, exact_match))
# Nested match criteria 'm.cri.1' OR 'm.cri.2'
n.cri <- sub_criteria(
m.cri.1, m.cri.2,
operator = "or")
# Record linkage with additional match criteria
p3 <- links(
criteria = stages,
sub_criteria = list(cr1 = m.cri.1,
cr2 = m.cri.2))
# Record linkage with additonal nested match criteria
p4 <- links(
criteria = stages,
sub_criteria = list(cr1 = n.cri,
cr2 = n.cri))
dfr$p1 <- p1; dfr$p2 <- p2
dfr$p3 <- p3; dfr$p4 <- p4
head(dfr)