| duplicate_detect {scrutiny} | R Documentation |
Detect duplicate values
Description
duplicate_detect() is superseded because it's less informative than
duplicate_tally() and duplicate_count(). Use these functions
instead.
For every value in a vector or data frame, duplicate_detect() tests
whether there is at least one identical value. Test results are presented
next to every value.
This function is a blunt tool designed for initial data checking. Don't put too much weight on its results.
For summary statistics, call audit() on the results.
Usage
duplicate_detect(x, ignore = NULL, colname_end = "dup", numeric_only)
Arguments
x |
Vector or data frame. |
ignore |
Optionally, a vector of values that should not be checked. In
the test result columns, they will be marked |
colname_end |
String. Name ending of the logical test result columns.
Default is |
numeric_only |
[Deprecated] No longer used: All values are coerced to character. |
Details
This function is not very informative with many input values that
only have a few characters each. Many of them may have duplicates just by
chance. For example, in R's built-in iris data set, 99% of values have
duplicates.
In general, the fewer values and the more characters per value, the more significant the results.
Value
A tibble (data frame). It has all the columns from x, and to each
of these columns' right, the corresponding test result column.
The tibble has the scr_dup_detect class, which is recognized by the
audit() generic.
Summaries with audit()
There is an S3 method for the
audit() generic, so you can call audit() following
duplicate_detect(). It returns a tibble with these columns —
-
term: The original data frame's variables. -
dup_count: Number of "duplicated" values of thattermvariable: those which have at least one duplicate anywhere in the data frame. -
total: Number of all non-NAvalues of thattermvariable. -
dup_rate: Rate of "duplicated" values of thattermvariable.
The final row, .total, summarizes across all other rows: It adds up the
dup_count and total_count columns, and calculates the mean of the
dup_rate column.
See Also
-
duplicate_tally()to count instances of a value instead of just stating whether it is duplicated. -
duplicate_count()for a frequency table. -
duplicate_count_colpair()to check each combination of columns for duplicates. -
janitor::get_dupes()to search for duplicate rows.
Examples
# Find duplicate values in a data frame...
duplicate_detect(x = pigs4)
# ...or in a single vector:
duplicate_detect(x = pigs4$snout)
# Summary statistics with `audit()`:
pigs4 %>%
duplicate_detect() %>%
audit()
# Any values can be ignored:
pigs4 %>%
duplicate_detect(ignore = c(8.131, 7.574))