dup_data_small {multilink}R Documentation

Small Duplicate Dataset

Description

A dataset containing 96 simulated records from 3 files with no duplicate records in each file, subset from dup_data.

Usage

dup_data_small

Format

A list with three elements:

records

A data.frame with the records, containing 7 fields, from all three files, in the format used for input to create_comparison_data.

file_sizes

The size of each file.

IDs

The true partition of the records, represented as an integer vector of arbitrary labels of length sum(file_sizes).

Source

Extracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.

References

Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242][arXiv]

Examples

data(dup_data_small)

# There are 96 entities represented in the records
length(unique(dup_data_small$IDs))

[Package multilink version 0.1.1 Index]