R: Remove duplicates from a bibliographic data set

deduplicate {synthesisr}

R Documentation

Remove duplicates from a bibliographic data set

Description

Removes duplicates using sensible defaults

Usage

deduplicate(data, match_by, method, type = "merge", ...)

Arguments

`data`	A `data.frame` containing bibliographic information.
`match_by`	Name of the column in `data` where duplicates should be sought.
`method`	The duplicate detection function to use; see see `link{string_}` or `link{fuzz_}` for examples. Passed to `find_duplicates`.
`type`	How should entries be selected? Default is `"merge"` which selected the entries with the largest number of characters in each column. Alternatively `"select"` returns the row with the highest total number of characters.
`...`	Arguments passed to `find_duplicates`.

Details

This is a wrapper function to find_duplicates and extract_unique_references, which tries to choose some sensible defaults. Use with care.

Value

A data.frame containing data identified as unique.

Examples

my_df <-  data.frame(
  title = c(
    "EviAtlas: a tool for visualising evidence synthesis databases",
    "revtools: An R package to support article screening for evidence synthesis",
    "An automated approach to identifying search terms for systematic reviews",
    "Reproducible, flexible and high-throughput data extraction from primary literature",
    "eviatlas:tool for visualizing evidence synthesis databases.",
    "REVTOOLS a package to support article-screening for evidence synthsis"
  ),
  year = c("2019", "2019", "2019", "2019", NA, NA),
  authors = c("Haddaway et al", "Westgate",
              "Grames et al", "Pick et al", NA, NA),
  stringsAsFactors = FALSE
)

# run deduplication
dups <- find_duplicates(
  my_df$title,
  method = "string_osa",
  rm_punctuation = TRUE,
  to_lower = TRUE
)

extract_unique_references(my_df, matches = dups)

# or, in one line:
deduplicate(my_df, "title",
  method = "string_osa",
  rm_punctuation = TRUE,
  to_lower = TRUE)

[Package synthesisr version 0.3.0 Index]