no_dup_data_small {multilink}R Documentation

Small No Duplicate Dataset

Description

A dataset containing 71 simulated records from 3 files with no duplicate records in each file, subset from no_dup_data.

Usage

no_dup_data_small

Format

A list with three elements:

records

A data.frame with the records, containing 7 fields, from all three files, in the format used for input to create_comparison_data.

file_sizes

The size of each file.

IDs

The true partition of the records, represented as an integer vector of arbitrary labels of length sum(file_sizes).

Source

Extracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242] [arXiv]

Examples

data(no_dup_data_small)

# There are 71 entities represented in the records
length(unique(no_dup_data_small$IDs))

[Package multilink version 0.1.1 Index]