R: Multistage record linkage

links {diyar}

R Documentation

Multistage record linkage

Description

Assign records to unique groups based on an ordered set of match criteria.

Usage

links(
  criteria,
  sub_criteria = NULL,
  sn = NULL,
  strata = NULL,
  data_source = NULL,
  data_links = "ANY",
  display = "none",
  group_stats = FALSE,
  expand = TRUE,
  shrink = FALSE,
  recursive = "none",
  check_duplicates = FALSE,
  tie_sort = NULL,
  batched = "yes",
  repeats_allowed = FALSE,
  permutations_allowed = FALSE,
  ignore_same_source = FALSE
)

Arguments

`criteria`	`[list\|atomic]`. Ordered list of attributes to be compared. Each element of the list is a stage in the linkage process. See `Details`.
`sub_criteria`	`[list\|sub_criteria]`. Nested match criteria. This must be paired to a stage of the linkage process (`criteria`). See `sub_criteria`
`sn`	`[integer]`. Unique record ID.
`strata`	`[atomic]`. Subsets of the dataset. Record-groups are created separately for each `strata`. See `Details`.
`data_source`	`[character]`. Source ID for each record. If provided, a list of all sources in each record-group is returned. See `pid_dataset slot`.
`data_links`	`[list\|character]`. `data_source` required in each `pid`. A record-group without records from these `data_sources` will be `unlinked`. See `Details`.
`display`	`[character]`. Display progress update and/or generate a linkage report for the analysis. Options are; `"none"` (default), `"progress"`, `"stats"`, `"none_with_report"`, `"progress_with_report"` or `"stats_with_report"`.
`group_stats`	`[character]`. A selection of group specific information to be return for each record-group. Most are added to slots of the `pid` object. Options are `NULL` or any combination of `"XX"`, `"XX"` and `"XX"`.
`expand`	`[logical]`. If `TRUE`, a record-group gains new records if a match is found at the next stage of the linkage process. Not interchangeable with `shrink`.
`shrink`	`[logical]`. If `TRUE`, a record-group loses existing records if no match is found at the next stage of the linkage process. Not interchangeable with `expand`.
`recursive`	`[logical]`. If `TRUE`, within each iteration of the process, a match can spawn new matches. Ignored when `batched` is `"no"`.
`check_duplicates`	`[logical]`. If `TRUE`, within each iteration of the process, duplicates values of an attributes are not checked. The outcome of the logical test on the first instance of the value will be recycled for the duplicate values. Ignored when `batched` is `"no"`.
`tie_sort`	`[atomic]`. Preferential order for breaking match ties within an iteration of record linkage.
`batched`	`[character]` Determines if record-pairs are created and compared in batches. Options are `"yes"`, `"no"` or `"semi"`.
`repeats_allowed`	`[logical]` If `TRUE`, pairs made up of repeat records are not created and compared. Only used when `batched` is `"no"`.
`permutations_allowed`	`[logical]` If `TRUE`, permutations of record-pairs are created and compared. Only used when `batched` is `"no"`.
`ignore_same_source`	`[logical]` If `TRUE`, only records-pairs from a different `data_source` are created and compared.

Details

The priority of matches decreases with each subsequent stage of the linkage process. Therefore, the attributes in criteria should be in an order of decreasing relevance.

Records with missing data (NA) for each criteria are skipped at the respective stage, while records with missing data strata are skipped from every stage.

If a record is skipped from a stage, another attempt will be made to match the record at the next stage. If a record is still unmatched by the last stage, it is assigned a unique group ID.

A sub_criteria adds nested match criteria to each stage of the linkage process. If used, only records with a matching criteria and sub_criteria are linked.

In links, each sub_criteria must be linked to a criteria. This is done by adding each sub_criteria to a named element of a list - "cr" concatenated with the corresponding stage's number. For example, 3 sub_criteria linked to criteria 1, 5 and 13 will be;

list(cr1 = sub_criteria(...), cr5 = sub_criteria(...), cr13 = sub_criteria(...))

Any unlinked sub_criteria will be ignored.

Every element in data_links must be named "l" (links) or "g" (groups). Unnamed elements of data_links will be assumed to be "l".

If named "l", groups without records from every listed data_source will be unlinked.
If named "g", groups without records from any listed data_source will be unlinked.

See vignette("links") for more information.

Value

pid; list

Examples

data(patient_records)
dfr <- patient_records
# An exact match on surname followed by an exact match on forename
stages <- as.list(dfr[c("surname", "forename")])
p1 <- links(criteria = stages)

# An exact match on forename followed by an exact match on surname
p2 <- links(criteria = rev(stages))

# Nested matches
# Same sex OR birth year
m.cri.1 <- sub_criteria(
  format(dfr$dateofbirth, "%Y"), dfr$sex,
  operator = "or")

# Same middle name AND a 10 year age difference
age_diff <- function(x, y){
  diff <- abs(as.numeric(x) - as.numeric(y))
  wgt <-  diff %in% 0:10 & !is.na(diff)
  wgt
}
m.cri.2 <- sub_criteria(
  format(dfr$dateofbirth, "%Y"), dfr$middlename,
  operator = "and",
  match_funcs = c(age_diff, exact_match))

# Nested match criteria 'm.cri.1' OR 'm.cri.2'
n.cri <- sub_criteria(
  m.cri.1, m.cri.2,
  operator = "or")

# Record linkage with additional match criteria
p3 <- links(
  criteria = stages,
  sub_criteria = list(cr1 = m.cri.1,
                      cr2 = m.cri.2))

# Record linkage with additonal nested match criteria
p4 <- links(
  criteria = stages,
  sub_criteria = list(cr1 = n.cri,
                      cr2 = n.cri))

dfr$p1 <- p1; dfr$p2 <- p2
dfr$p3 <- p3; dfr$p4 <- p4

head(dfr)

[Package diyar version 0.5.1 Index]