identify_inconsistency {eHDPrep}R Documentation

Identify inconsistencies in a dataset

Description

Tests pairs of variables for consistency between their values according to a table of rules or 'consistency table'.

Usage

identify_inconsistency(data = NULL, consis_tbl = NULL, id_var = NULL)

Arguments

data

data frame which will be checked for internal consistency

consis_tbl

data frame or tibble containing information on internal consistency rules (see "Consistency Table Requirements" section)

id_var

An unquoted expression which corresponds to a variable in data which identifies each row.

Details

Multiple types of checks for inconsistency are supported:

  1. Comparing by logical operators (<, <=, ==, !=, >=, >)

  2. Comparing permitted categories (e.g. cat1 in varA only if cat2 in varB)

  3. Comparing permitted numeric ranges (e.g. 20-25 in varC only if 10-20 in varD)

  4. Mixtures of 2 and 3 (e.g. cat1 in varA only if 20-25 in varC)

The consistency tests rely on such rules being specified in a separate data frame (consis_tbl; see section "Consistency Table Requirements").

Variable A is given higher priority than Variable B when A is a category. If A (as char) is not equal to the value in col 4, the check is not made. This is to account for one way dependencies (i.e. VarA is fruit, VarB is apple)

Value

tibble detailing any identified internal inconsistencies in data, if any are found. If no inconsistencies are found, data is returned invisibly.

Consistency Table Requirements

Table must have exactly five character columns. The columns should be ordered according to the list below which describes the values of each column:

  1. First column name of data values that will be subject to consistency checking. String. Required.

  2. Second column name of data values that will be subject to consistency checking. String. Required.

  3. Logical test to compare columns one and two. One of: ">",">=", "<","<=","==", "!=". String. Optional if columns 4 and 5 have non-NA values.

  4. Either a single character string or a colon-separated range of numbers which should only appear in column A. Optional if column 3 has a non-NA value.

  5. Either a single character string or a colon-separated range of numbers which should only appear in column B given the value/range specified in column 4. Optional if column 3 has a non-NA value.

Each row should detail one test to make. Therefore, either column 3 or columns 4 and 5 must contain non-NA values.

See Also

Other internal consistency functions: validate_consistency_tbl()

Examples

require(tibble)
# example with synthetic dataset on number of bean counts
# there is a lot going on in the function so a simple dataset aids this example
#
# creating `data`:
beans <- tibble::tibble(red_beans = 1:15,
blue_beans = 1:15,
total_beans = 1:15*2,
red_bean_summary = c(rep("few_beans",9), rep("many_beans",6)))
#
# creating `consis_tbl`
bean_rules <- tibble::tribble(~varA, ~varB, ~lgl_test, ~varA_boundaries, ~varB_boundaries,
"red_beans", "blue_beans", "==", NA, NA,
"red_beans", "total_beans", "<=", NA,NA,
"red_beans", "red_bean_summary", NA, "1:9", "few_beans",
"red_beans", "red_bean_summary", NA, "10:15", "many_beans")

identify_inconsistency(beans, bean_rules)

# creating some inconsistencies as examples
beans[1, "red_bean_summary"] <- "many_beans"
beans[1, "red_beans"] <- 10

identify_inconsistency(beans, bean_rules)


[Package eHDPrep version 1.3.3 Index]