RA {ludic} | R Documentation |
Anonymized binarized diagnosis codes from RA study.
Description
An anonymized version of the binarized diagnosis code data from the RA1 and RA2 datasets, over both 6-year and 11-year time span.
Usage
data(RA)
Format
5 objects
RA1_6y
: an integer matrix of 0s and 1s containing 4,936 renamed diagnosis codes for 26,681 patients from the dataset RA1 recorded over a 6-year time span.RA2_6y
: an integer matrix of 0s and 1s containing 4,936 renamed diagnosis codes for 5,707 patients from the dataset RA2 recorded over a 6-year time span.RA1_11y
: an integer matrix of 0s and 1s containing 5,593 renamed diagnosis codes for 26,687 patients from the dataset RA1 recorded over a 11-year time span.RA2_11y
: an integer matrix of 0s and 1s containing 5,593 renamed diagnosis codes for 6,394 patients from the dataset RA2 recorded over a 11-year time span.silverstandard_truematches
: a character matrix with two columns containing the identifiers of the 3,831 pairs of silver-standard matches.
Details
The ICD-9 diagnosis codes have also been masked and randomly reordered, replaced by meaningless names. Finally, the silver-standard matching pairs are also provided to allow the benchmarking of methods for probabilistic record linkage using diagnosis codes.
References
Hejblum BP, Weber G, Liao KP, Palmer N, Churchill S, Szolovits P, Murphy S, Kohane I and Cai T, Probabilistic Record Linkage of De-Identified Research Datasets Using Diagnosis Codes, Scientific Data, 6:180298 (2019). doi: 10.1038/sdata.2018.298.
Liao, K. P. et al. Electronic medical records for discovery research in rheumatoid arthritis. Arthritis Care & Research 62, 1120-1127 (2010). doi: 10.1002/acr.20184
Liao, K. P. et al. Methods to Develop an Electronic Medical Record Phenotype Algorithm to Compare the Risk of Coronary Artery Disease across 3 Chronic Disease Cohorts. PLoS ONE 10, e0136651 (2015). doi: 10.1371/journal.pone.0136651
Examples
if(interactive()){
rm(list=ls())
library(ludic)
data(RA)
res_match_6y <- recordLink(data1 = RA1_6y, data2 = RA2_6y,
eps_plus = 0.01, eps_minus = 0.01,
aggreg_2ways ="mean",
min_prev = 0,
use_diff = FALSE)
res_match_11y <- recordLink(data1 = RA1_11y, data2 = RA2_11y,
eps_plus = 0.01, eps_minus = 0.01,
aggreg_2ways ="mean",
min_prev = 0,
use_diff = FALSE)
print.res_matching <- function(res, threshold=0.9, ref=silverstandard_truematches){
have_match_row <- rowSums(res>threshold)
have_match_col <- colSums(res>threshold)
bestmatched_pairs_all <- cbind.data.frame(
"D1"=rownames(res)[apply(res[,which(have_match_col>0), drop=FALSE], 2, which.max)],
"D2"=names(have_match_col)[which(have_match_col>0)]
)
nTM_all <- nrow(ref)
nP_all <- nrow(bestmatched_pairs_all)
TPR_all <- sum(apply(bestmatched_pairs_all, 1, paste0, collapse="")
%in% apply(ref, 1, paste0, collapse=""))/nTM_all
PPV_all <- sum(apply(bestmatched_pairs_all, 1, paste0, collapse="")
%in% apply(ref, 1, paste0, collapse=""))/nP_all
cat("threshold: ", threshold,
"\nnb matched: ", nP_all,"; nb true matches: ", nTM_all,
"\nTPR: ", TPR_all, "; PPV: ", PPV_all, "\n\n", sep="")
}
print.res_matching(res_match_6y)
print.res_matching(res_match_11y)
}