locate_errors {errorlocate} | R Documentation |
Find errors in data
Description
Find out which fields in a data.frame are "faulty" using validation rules
This method returns found errors, according to the specified method x
.
Use method replace_errors()
, to automatically remove these errors.
'
Usage
locate_errors(
data,
x,
...,
cl = NULL,
Ncpus = getOption("Ncpus", 1),
timeout = 60
)
## S4 method for signature 'data.frame,validator'
locate_errors(
data,
x,
weight = NULL,
ref = NULL,
...,
cl = NULL,
Ncpus = getOption("Ncpus", 1),
timeout = 60
)
## S4 method for signature 'data.frame,ErrorLocalizer'
locate_errors(
data,
x,
weight = NULL,
ref = NULL,
...,
cl = NULL,
Ncpus = getOption("Ncpus", 1),
timeout = 60
)
Arguments
data |
data to be checked |
x |
validation rules or errorlocalizer object to be used for finding possible errors. |
... |
optional parameters that are passed to |
cl |
optional parallel / cluster. |
Ncpus |
number of nodes to use. See details |
timeout |
maximum number of seconds that the localizer should use per record. |
weight |
|
ref |
|
Details
Use an Inf
weight
specification to fixate variables that can not be changed.
See expand_weights()
for more details.
locate_errors
uses lpSolveAPI to formulate and solves a mixed integer problem.
For details see the vignettes.
This solver has many options: lpSolveAPI::lp.control.options. Noteworthy
options to be used are:
-
timeout
: restricts the time the solver spends on a record (seconds) -
break.at.value
: set this to minimum weight + 1 to improve speed. -
presolve
: default for errorlocate is "rows". Set to "none" when you have solutions where all variables are deemed wrong.
locate_errors
can be run on multiple cores using R package parallel
.
The easiest way to use the parallel option is to set
Ncpus
to the number of desired cores, @seealsoparallel::detectCores()
.Alternatively one can create a cluster object (
parallel::makeCluster()
) and usecl
to pass the cluster object.Or set
cl
to an integer which results inparallel::mclapply()
, which only works on non-windows.
Value
errorlocation-class()
object describing the errors found.
See Also
Other error finding:
errorlocation-class
,
errors_removed()
,
expand_weights()
,
replace_errors()
Examples
rules <- validator( profit + cost == turnover
, cost >= 0.6 * turnover # cost should be at least 60% of turnover
, turnover >= 0 # can not be negative.
)
data <- data.frame( profit = 755
, cost = 125
, turnover = 200
)
le <- locate_errors(data, rules)
print(le)
summary(le)
v_categorical <- validator( branch %in% c("government", "industry")
, tax %in% c("none", "VAT")
, if (tax == "VAT") branch == "industry"
)
data <- read.csv(text=
" branch, tax
government, VAT
industry , VAT
", strip.white = TRUE)
locate_errors(data, v_categorical)$errors
v_logical <- validator( citizen %in% c(TRUE, FALSE)
, voted %in% c(TRUE, FALSE)
, if (voted == TRUE) citizen == TRUE
)
data <- data.frame(voted = TRUE, citizen = FALSE)
locate_errors(data, v_logical, weight=c(2,1))$errors
# try a condinational rule
v <- validator( married %in% c(TRUE, FALSE)
, if (married==TRUE) age >= 17
)
data <- data.frame( married = TRUE, age = 16)
locate_errors(data, v, weight=c(married=1, age=2))$errors
# different weights per row
data <- read.csv(text=
"married, age
TRUE, 16
TRUE, 14
", strip.white = TRUE)
weight <- read.csv(text=
"married, age
1, 2
2, 1
", strip.white = TRUE)
locate_errors(data, v, weight = weight)$errors
# fixate / exclude a variable from error localiziation
# using an Inf weight
weight <- c(age = Inf)
locate_errors(data, v, weight = weight)$errors