identify_inconsistency {eHDPrep} | R Documentation |
Identify inconsistencies in a dataset
Description
Tests pairs of variables for consistency between their values according to a table of rules or 'consistency table'.
Usage
identify_inconsistency(data = NULL, consis_tbl = NULL, id_var = NULL)
Arguments
data |
data frame which will be checked for internal consistency |
consis_tbl |
data frame or tibble containing information on internal consistency rules (see "Consistency Table Requirements" section) |
id_var |
An unquoted expression which corresponds to a variable in
|
Details
Multiple types of checks for inconsistency are supported:
Comparing by logical operators (<, <=, ==, !=, >=, >)
Comparing permitted categories (e.g. cat1 in varA only if cat2 in varB)
Comparing permitted numeric ranges (e.g. 20-25 in varC only if 10-20 in varD)
Mixtures of 2 and 3 (e.g. cat1 in varA only if 20-25 in varC)
The consistency tests rely on such rules being specified in a
separate data frame (consis_tbl
; see section "Consistency Table Requirements").
Variable A is given higher priority than Variable B when A is a category. If A (as char) is not equal to the value in col 4, the check is not made. This is to account for one way dependencies (i.e. VarA is fruit, VarB is apple)
Value
tibble detailing any identified internal inconsistencies in
data
, if any are found. If no inconsistencies are found, data
is returned invisibly.
Consistency Table Requirements
Table must have exactly five character columns. The columns should be ordered according to the list below which describes the values of each column:
First column name of data values that will be subject to consistency checking. String. Required.
Second column name of data values that will be subject to consistency checking. String. Required.
Logical test to compare columns one and two. One of: ">",">=", "<","<=","==", "!=". String. Optional if columns 4 and 5 have non-
NA
values.Either a single character string or a colon-separated range of numbers which should only appear in column A. Optional if column 3 has a non-
NA
value.Either a single character string or a colon-separated range of numbers which should only appear in column B given the value/range specified in column 4. Optional if column 3 has a non-
NA
value.
Each row should detail one test to make.
Therefore, either column 3 or columns 4 and 5 must contain non-NA
values.
See Also
Other internal consistency functions:
validate_consistency_tbl()
Examples
require(tibble)
# example with synthetic dataset on number of bean counts
# there is a lot going on in the function so a simple dataset aids this example
#
# creating `data`:
beans <- tibble::tibble(red_beans = 1:15,
blue_beans = 1:15,
total_beans = 1:15*2,
red_bean_summary = c(rep("few_beans",9), rep("many_beans",6)))
#
# creating `consis_tbl`
bean_rules <- tibble::tribble(~varA, ~varB, ~lgl_test, ~varA_boundaries, ~varB_boundaries,
"red_beans", "blue_beans", "==", NA, NA,
"red_beans", "total_beans", "<=", NA,NA,
"red_beans", "red_bean_summary", NA, "1:9", "few_beans",
"red_beans", "red_bean_summary", NA, "10:15", "many_beans")
identify_inconsistency(beans, bean_rules)
# creating some inconsistencies as examples
beans[1, "red_bean_summary"] <- "many_beans"
beans[1, "red_beans"] <- 10
identify_inconsistency(beans, bean_rules)