compare {validate} | R Documentation |
Compare similar data sets
Description
Compare versions of a data set by comparing their performance against a
set of rules or other quality indicators. This function takes two or
more data sets and compares the perfomance of data set 2,3,\ldots
against that of the first data set (default) or to the previous one
(by setting how='sequential'
).
Usage
compare(x, ...)
## S4 method for signature 'validator'
compare(x, ..., .list = list(), how = c("to_first", "sequential"))
## S4 method for signature 'indicator'
compare(x, ..., .list = NULL)
Arguments
x |
An R object |
... |
data frames, comma separated. Names become column names in the output. |
.list |
Optional list of data sets, will be concatenated with |
how |
how to compare |
Value
For validator
: An array where each column represents
one dataset.
The rows count the following attributes:
Number of validations performed
Number of validations that evaluate to
NA
(unverifiable)Number of validations that evaluate to a logical (verifiable)
Number of validations that evaluate to
TRUE
Number of validations that evaluate to
FALSE
Number of extra validations that evaluate to
NA
(new unverifiable)Number of validations that still evaluate to
NA
(still unverifialble)Number of validations that still evaluate to
TRUE
Number of extra validations that evaluate to
TRUE
Number of validations that still evaluate to
FALSE
Number of extra validations that evaluate to
FALSE
For indicator
: A list with the following components:
numeric
: An array collecting results of scalar indicator (e.g.mean(x)
).nonnumeric
: An array collecting results of nonnumeric scalar indicators (e.g. names(which.max(table(x))))array
: A list of arrays, collecting results of vector-indicators (e.g. x/mean(x))
Comparing datasets by performance against validator objects
Suppose we have a current and a previous version of a data set. Both
can be inspected by confront
ing them with a rule set.
The status changes in rule violations can be partitioned as shown in the
following figure.
This function computes the partition for two or more
datasets, comparing the current set to the first (default) or to the
previous (by setting compare='sequential'
).
References
The figure is reproduced from MPJ van der Loo and E. De Jonge (2018) Statistical Data Cleaning with applications in R (John Wiley & Sons).
See Also
Other validation-methods:
aggregate,validation-method
,
all,validation-method
,
any,validation-method
,
barplot,validation-method
,
check_that()
,
confront()
,
event()
,
names<-,rule,character-method
,
plot,validation-method
,
sort,validation-method
,
summary()
,
validation-class
,
values()
Other comparing:
as.data.frame,cellComparison-method
,
as.data.frame,validatorComparison-method
,
barplot,cellComparison-method
,
barplot,validatorComparison-method
,
cells()
,
match_cells()
,
plot,cellComparison-method
,
plot,validatorComparison-method
Examples
data(retailers)
rules <- validator(turnover >=0, staff>=0, other.rev>=0)
# start with raw data
step0 <- retailers
# impute turnovers
step1 <- step0
step1$turnover[is.na(step1$turnover)] <- mean(step1$turnover,na.rm=TRUE)
# flip sign of negative revenues
step2 <- step1
step2$other.rev <- abs(step2$other.rev)
# create an overview of differences, comparing to the previous step
compare(rules, raw = step0, imputed = step1, flipped = step2, how="sequential")
# create an overview of differences compared to raw data
out <- compare(rules, raw = step0, imputed = step1, flipped = step2)
out
# graphical overview
plot(out)
barplot(out)
# transform data to data.frame (easy for use with ggplot)
as.data.frame(out)