links {diyar} | R Documentation |
Match records in consecutive stages with different matching criteria. Each set of linked records are assigned a unique identifier with relevant group-level information.
links(
criteria,
sub_criteria = NULL,
sn = NULL,
strata = NULL,
data_source = NULL,
data_links = "ANY",
display = "none",
group_stats = FALSE,
expand = TRUE,
shrink = FALSE,
recursive = FALSE,
check_duplicates = FALSE,
tie_sort = NULL
)
criteria |
|
sub_criteria |
|
sn |
|
strata |
|
data_source |
|
data_links |
|
display |
|
group_stats |
|
expand |
|
shrink |
|
recursive |
|
check_duplicates |
|
tie_sort |
|
Match priority decreases with each subsequent stage of the linkage process
i.e. earlier stages (criteria
) are considered superior.
Therefore, it's important for each criteria
to be listed in an order of decreasing relevance.
Records with missing criteria
(NA
) are skipped at each stage, while
records with missing strata
(NA
) are skipped from the entire linkage process.
If a record is skipped, another attempt will be made to match the record at the next stage.
If a record does not match any other record by the end of the linkage process (or it has a missing strata
),
it is assigned to a unique record-group.
A sub_criteria
can be used to request additional matching conditions for each stage of the linkage process.
When used, only records with a matching criteria
and sub_criteria
are linked.
In links
, each sub_criteria
must be linked to a criteria
.
This is done by adding a sub_criteria
to a named element of a list
.
Each element's name must correspond to a stage. See below for an example of 3 sub_criteria
linked to
criteria
1
, 5
and 13
.
For example;
list("cr1" = sub_criteria(...), "cr5" = sub_criteria(...), "cr13" = sub_criteria(...)).
sub_criteria
can be nested to achieve nested conditions.
A sub_criteria
can be linked to different criteria
but any unlinked sub_criteria
will be ignored.
By default, attributes in a sub_criteria
are compared for an exact_match
.
However, user-defined functions are also permitted. Such functions must meet three requirements:
It must be able to compare the attributes.
It must have two arguments named `x`
and `y`
, where `y`
is the value for one observation being compared against all other observations (`x`
).
It must return a logical
object i.e.TRUE
or FALSE
.
Every element in data_links
must be named "l"
(links) or "g"
(groups).
Unnamed elements of data_links
will be assumed to be "l"
.
If named "l"
, only groups with records from every listed data_source
will remain linked.
If named "g"
, only groups with records from any listed data_source
will remain linked.
See vignette("links")
for more information.
pid
; list
link_records
; episodes
; partitions
; predefined_tests
; sub_criteria
; schema
# Exact match
attr_1 <- c(1, 1, 1, NA, NA, NA, NA, NA)
attr_2 <- c(NA, NA, 2, 2, 2, NA, NA, NA)
links(criteria = list(attr_1, attr_2))
# User-defined tests using `sub_criteria()`
# Matching `sex` and a 20-year age range
age <- c(30, 28, 40, 25, 25, 29, 27)
sex <- c("M", "M", "M", "F", "M", "M", "F")
f1 <- function(x, y) abs(y - x) %in% 0:20
links(criteria = sex,
sub_criteria = list(cr1 = sub_criteria(age, match_funcs = f1)))
# Multistage matches
# Relevance of matches: `forename` > `surname`
data(staff_records); staff_records
links(criteria = list(staff_records$forename, staff_records$surname),
data_source = staff_records$sex)
# Relevance of matches:
# `staff_id` > `age` (AND (`initials`, `hair_colour` OR `branch_office`))
data(missing_staff_id); missing_staff_id
links(criteria = list(missing_staff_id$staff_id, missing_staff_id$age),
sub_criteria = list(cr2 = sub_criteria(missing_staff_id$initials,
missing_staff_id$hair_colour,
missing_staff_id$branch_office)),
data_source = missing_staff_id$source_1)
# Group expansion
match_cri <- list(c(1,NA,NA,1,NA,NA),
c(1,1,1,2,2,2),
c(3,3,3,2,2,2))
links(criteria = match_cri, expand = TRUE)
links(criteria = match_cri, expand = FALSE)
links(criteria = match_cri, shrink = TRUE)