R: Identify and return duplicated rows in a data frame or...

find_duplicates {cleanepi}

R Documentation

Identify and return duplicated rows in a data frame or linelist.

Description

Identify and return duplicated rows in a data frame or linelist.

Usage

find_duplicates(data, target_columns = NULL)

Arguments

`data`	A data frame or linelist.
`target_columns`	A vector of columns names or indices to consider when looking for duplicates. When the input data is a `linelist` object, this parameter can be set to `tags`from which duplicates to be removed. Its default value is `NULL`, which considers duplicates across all columns.

Value

A data frame or linelist of all duplicated rows with following 2 additional columns:

row_id: the indices of the duplicated rows from the input data. Users can choose from these indices, which row they consider as redundant in each group of duplicates.
group_id: a unique identifier associated to each group of duplicates.

Examples

dups <- find_duplicates(
  data           = readRDS(system.file("extdata", "test_linelist.RDS",
                                       package = "cleanepi")),
  target_columns = c("dt_onset", "dt_report", "sex", "outcome")
)

[Package cleanepi version 1.0.2 Index]