ProbabilisticLinkage {PPRL}R Documentation

Probabilistic Record Linkage

Description

Probabilistic Record Linkage of two data sets using distance-based or probabilistic methods.

Usage

ProbabilisticLinkage(IDA, dataA, IDB, dataB,  blocking = NULL, similarity)

Arguments

IDA

A character vector or integer vector containing the IDs of the first data.frame.

dataA

A data.frame containing the data to be linked and all linking variables as specified in SelectBlockingFunction and SelectSimilarityFunction.

IDB

A character vector or integer vector containing the IDs of the second data.frame.

dataB

A data.frame containing the data to be linked and all linking variables as specified in SelectBlockingFunction and SelectSimilarityFunction.

blocking

Optional blocking variables. See SelectBlockingFunction.

similarity

Variables used for linking and their respective linkage methods as specified in SelectSimilarityFunction.

Details

To call the Probabilistic Linkage function it is necessary to set up linking variables and methods. Using blocking variables is optional. Further options are available in SelectBlockingFunction and SelectSimilarityFunction. Using this method, the Fellegi-Sunter model is used, with the EM algorithm estimating the weights (Winkler 1988).

Value

A data.frame containing pairs of IDs, their corresponding similarity value and the match status as determined by the linkage procedure.

Source

Christen, P. (2012): Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer.

Schnell, R., Bachteler, T., Reiher, J. (2004): A toolbox for record linkage. Austrian Journal of Statistics 33(1-2): 125-133.

Winkler, W. E. (1988): Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. Proceedings of the Section on Survey Research Methods Vol. 667, American Statistical Association: 671.

See Also

CreateBF, CreateCLK, DeterministicLinkage, SelectBlockingFunction, SelectSimilarityFunction, StandardizeString

Examples

# load test data
testFile <- file.path(path.package("PPRL"), "extdata/testdata.csv")
testData <- read.csv(testFile, head = FALSE, sep = "\t",
  colClasses = "character")

# define year of birth (V3) as blocking variable
bl <- SelectBlockingFunction("V3", "V3", method = "exact")

# Select first name and last name as linking variables,
# to be linked using the Jaro-Winkler similarity measure (first name)
# and levenshtein distance (last name)
l1 <- SelectSimilarityFunction("V7", "V7", method = "jw")
l2 <- SelectSimilarityFunction("V8", "V8", method = "lv")

# Link the data as specified in bl and l1/l2
# (in this small example data is linked to itself)
res <- ProbabilisticLinkage(testData$V1, testData,
  testData$V1, testData, similarity = c(l1, l2), blocking = bl)


[Package PPRL version 0.3.8 Index]