no_dup_data_small {multilink} | R Documentation |
Small No Duplicate Dataset
Description
A dataset containing 71
simulated records from 3
files with
no duplicate records in each file, subset from no_dup_data
.
Usage
no_dup_data_small
Format
A list with three elements:
- records
A
data.frame
with the records, containing7
fields, from all three files, in the format used for input tocreate_comparison_data
.- file_sizes
The size of each file.
- IDs
The true partition of the records, represented as an
integer
vector of arbitrary labels of lengthsum(file_sizes)
.
Source
Extracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242] [arXiv]
Examples
data(no_dup_data_small)
# There are 71 entities represented in the records
length(unique(no_dup_data_small$IDs))