multifileRL {bstrl}R Documentation

Perform multifile record linkage via Gibbs sampling "from scratch"

Description

Perform multifile record linkage via Gibbs sampling "from scratch"

Usage

multifileRL(
  files,
  flds = NULL,
  types = NULL,
  breaks = c(0, 0.25, 0.5),
  nIter = 1000,
  burn = round(nIter * 0.1),
  a = 1,
  b = 1,
  aBM = 1,
  bBM = 1,
  proposals = c("component", "LB"),
  blocksize = NULL,
  seed = 0,
  refresh = 0.1,
  maxtime = Inf
)

Arguments

files

A list of files

flds

Vector of names of the fields on which to compare the records in each file

types

Types of comparisons to use for each field

breaks

Breaks to use for Levenshtein distance on string fields

nIter, burn

MCMC run length parameters. The returned number of samples is nIter - burn.

a, b

Prior parameters for m and u, respectively.

aBM, bBM

Prior parameters for beta-linkage prior.

proposals

Which kind of full conditional proposals to use for the link vectors.

blocksize

What blocksize to use for locally balanced proposals. By default, LB proposals are not blocked

seed

Random seed to set at beginning of MCMC run

refresh

How often to output an update including the iteration number and percent complete. If refresh >= 1, taken as a number of iterations between messages (rounded). If 0 < refresh < 1, taken as the proportion of nIter. If refresh == 0, no messages are displayed.

maxtime

Amount of time, in seconds, after which the sampler will terminate with however many samples it has produced up to that point. The sample matrix columns for any unproduced samples will be filled with NAs

Value

An object of class "bstrlstate"

Examples

data(geco_small)

# Names of the columns on which to perform linkage
fieldnames <- c("given.name", "surname", "age", "occup",
                "extra1", "extra2", "extra3", "extra4", "extra5", "extra6")

# How to compare each of the fields
# First name and last name use normalized edit distance
# All others binary equal/unequal
types <- c("lv", "lv",
           "bi", "bi", "bi", "bi", "bi", "bi", "bi", "bi")
# Break continuous difference measures into 4 levels using these split points
breaks <- c(0, 0.25, 0.5)

# Three file linkage using first three files in example dataset
multifile.result <- multifileRL(geco_small[1:3],
                                flds = fieldnames, types = types, breaks = breaks,
                                nIter = 2, burn = 1, # Very small run for example
                                proposals = "comp",
                                seed = 0)


[Package bstrl version 1.0.2 Index]