R: Document-level matching of bibliographic datasets

biblioverlap {biblioverlap}

R Documentation

Document-level matching of bibliographic datasets

Description

This function identifies document overlap between bibliographic datasets and records it through the use of Universally Unique Identifiers (UUID).

Usage

biblioverlap(
  db_list,
  matching_fields = default_matching_fields,
  n_threads = 1,
  ti_penalty = 0.1,
  ti_max = 0.6,
  so_penalty = 0.1,
  so_max = 0.3,
  au_penalty = 0.1,
  au_max = 0.3,
  py_max = 0.3,
  score_cutoff = 1
)

Arguments

`db_list`	list of dataframes containing the sets of bibliographic data
`matching_fields`	Five column names used in the matching. Should be universal across all datasets and provided as a named list with the following names: DI (unique identifier), TI (document title), PY (publication year), SO (publication source) and AU (Authors). Default values come from The Lens scholar field definition.
`n_threads`	number of (logical) cores used in the matching procedures. Default: 1
`ti_penalty`	penalty applied for each increment in Title's Levenshtein distance. Default: 0.1
`ti_max`	max score value for Title. Default: 0.6
`so_penalty`	penalty applied for each increment in Source's Levenshtein distance. Default: 0.1
`so_max`	max score value for Source. Default: 0.3
`au_penalty`	penalty applied for each increment in Author's Levenshtein distance. Default: 0.1
`au_max`	max score value for Author. Default: 0.3
`py_max`	max score value for Publication Year. Default: 0.3
`score_cutoff`	minimum final score for a valid match between two documents. Default: 1

Details

In this procedure, any duplicates in the same dataset are removed. Then, Universally Unique Identifiers (UUID) are attributed to each record. If a match is found between two documents in a pairwise comparison, the UUID of the record from the first dataset is copied to the record on the second.

All preprocessing and modifications to the dataset are performed in a copy of the original data, which is used internally by the program. After all pairwise comparisons are completed, the UUID data is added as a new column in the original data.

Thus, the db_list returned by this function contains the same fields provided by the user plus the UUID column with the overlap information. This allows for further analysis using other fields (e.g. 'number of citations' or 'document type').

Value

a list object containing:

(i) db_list: a modified version of db_list where matching documents share the same UUID

(ii) summary: a summary of the results of the matching procedure

Note

In its internal data, the program will attempt to split the AU (Author) field to extract only the first author, for which it will calculate the Levenshtein distance.

It assumes that the AU field is ";" (semicolon) separated. Thus, in order to correctly perform the matching procedure to when another separator is being applied to this field, the user can either: (i) change the separator to semicolon; or (ii) create a new column containing only the first author.

Examples

#Example list of input dataframes
lapply(ufrj_bio_0122, head, n=1)

#List of columns for matching (identical to biblioverlap()'s defaults)
matching_cols <- list(DI = 'DOI',
                      TI = 'Title',
                      PY = 'Publication Year',
                      AU = 'Author/s',
                      SO = 'Source Title')

#Running document-level matching procedure (first two dataframes)
biblioverlap_results <- biblioverlap(ufrj_bio_0122[1:2], matching_fields = matching_cols)

#Taking a look at the matched db_list
lapply(biblioverlap_results$db_list, head, n=1)

#Taking a look at the matching results summary
biblioverlap_results$summary

[Package biblioverlap version 1.0.2 Index]