identify_inconsistency {eHDPrep} | R Documentation |
Tests pairs of variables for consistency between their values according to a table of rules or 'consistency table'.
identify_inconsistency(data = NULL, consis_tbl = NULL, id_var = NULL)
data |
data frame which will be checked for internal consistency |
consis_tbl |
data frame or tibble containing information on internal consistency rules (see "Consistency Table Requirements" section) |
id_var |
An unquoted expression which corresponds to a variable in
|
Multiple types of checks for inconsistency are supported:
Comparing by logical operators (<, <=, ==, !=, >=, >)
Comparing permitted categories (e.g. cat1 in varA only if cat2 in varB)
Comparing permitted numeric ranges (e.g. 20-25 in varC only if 10-20 in varD)
Mixtures of 2 and 3 (e.g. cat1 in varA only if 20-25 in varC)
The consistency tests rely on such rules being specified in a
separate data frame (consis_tbl
; see section "Consistency Table Requirements").
Variable A is given higher priority than Variable B when A is a category. If A (as char) is not equal to the value in col 4, the check is not made. This is to account for one way dependencies (i.e. VarA is fruit, VarB is apple)
tibble detailing any identified internal inconsistencies in
data
, if any are found. If no inconsistencies are found, data
is returned invisibly.
Table must have exactly five character columns. The columns should be ordered according to the list below which describes the values of each column:
First column name of data values that will be subject to consistency checking. String. Required.
Second column name of data values that will be subject to consistency checking. String. Required.
Logical test to compare columns one and two. One of: ">",">=",
"<","<=","==", "!=". String. Optional if columns 4 and 5 have non-NA
values.
Either a single character string or a colon-separated range of
numbers which should only appear in column A. Optional if column 3 has a
non-NA
value.
Either a single character string or a colon-separated range of
numbers which should only appear in column B given the value/range
specified in column 4. Optional if column 3 has a non-NA
value.
Each row should detail one test to make.
Therefore, either column 3 or columns 4 and 5 must contain non-NA
values.
Other internal consistency functions:
validate_consistency_tbl()
require(tibble)
# example with synthetic dataset on number of bean counts
# there is a lot going on in the function so a simple dataset aids this example
#
# creating `data`:
beans <- tibble::tibble(red_beans = 1:15,
blue_beans = 1:15,
total_beans = 1:15*2,
red_bean_summary = c(rep("few_beans",9), rep("many_beans",6)))
#
# creating `consis_tbl`
bean_rules <- tibble::tribble(~varA, ~varB, ~lgl_test, ~varA_boundaries, ~varB_boundaries,
"red_beans", "blue_beans", "==", NA, NA,
"red_beans", "total_beans", "<=", NA,NA,
"red_beans", "red_bean_summary", NA, "1:9", "few_beans",
"red_beans", "red_bean_summary", NA, "10:15", "many_beans")
identify_inconsistency(beans, bean_rules)
# creating some inconsistencies as examples
beans[1, "red_bean_summary"] <- "many_beans"
beans[1, "red_beans"] <- 10
identify_inconsistency(beans, bean_rules)