dataquality {ems} | R Documentation |
Collection of functions to check data quality in a dataset and remove not valid or extreme values.
Description
These functions return the counts and fractions of expected values, unexpected values, missing values and not valid values. They are able to do it with factor variables, numeric variables and date variables. t_factor
, t_num
, and t_date
do the job for a single variable and have simpler arguments, while factor.table
, num.table
, and date.table
do the job for several variables at once. rm.unwanted
cheks the factor and numeric variables and remove the not valid or extreme values. This approach is attractive before data imputation. They all return a data.frame
.
t_factor
and factor.table
will try to get factor or character variables and check how much of its content match with the expectd. They will try to treat the levels or cells containing " " as NAs
.
t_num
will try to get a numeric variable (even if it is currently formated as character or factor) and check how much of its content is expected (match a desired range), unexpected, non-numeric values and missing vlaues. num.table
does the same, but for two or more variables at once.
t_date
will try to get a date variable (even if it is currently formated as character or factor) and check how much of its content is expected (match a desired range), unexpected, non-date values and missing vlaues. date.table
does the same, but for two or more variables at once.
rm.unwanted
will chek in data the variables specified in the limits object according to the limits specified for each variable. If there are levels considered not valid in a factor variable, these levels are deleted. For example, if Sex is expected to be "M" and "F", and there is also an "I" level in data, every "I" is replaced by NA
. Similarly, misspelled levels will be understood as non-valid levels and coercerd to NA
, with the exception of leading or trailing empty spaces and lower and upper cases diferences if try.keep = TRUE
. If there is a continuous numeric variable and it is expected to have values ranging from 30 to 700, the values outside this range, i.e. higher then 700 or lower then 30, are replaced by NA
. Non-numeric elements, i.e. non-valid elements that should be numeric, will also be coerced to NA
. If a varible is specified in num.limits
, then it will be returned as a numeric variable, even if it was formated as factor or character. If a variable is specified in limits, the returnig format will depend on the stringAsFactors
argument, unless it is formated as logical. In this case it is skipped. The arguments limits
and num.limits
may be NULL
, meaning that the factor-character variables or the numeric variables , respectively, will not be edited.
Usage
t_factor(
data,
variable,
legal,
var.labels = attr(data, "var.labels")[match(variable, names(data))],
digits = 3
)
factor.table(
data,
limits,
var.labels = attr(data, "var.labels")[match(unlist(sapply(seq_along(limits),
function(i) limits[[i]][1])), names(data))],
digits = 3
)
t_num(
data,
num.var,
num.max = 100,
num.min = 0,
var.labels = attr(data, "var.labels")[match(num.var, names(data))],
digits = 3
)
num.table(
data,
num.limits,
var.labels = attr(data, "var.labels")[match(num.limits$num.var, names(data))],
digits = 3
)
t_date(
data,
date.var,
date.max = as.Date("2010-11-30"),
date.min = as.Date("2010-01-31"),
format.date = "auto",
digits = 3,
var.labels = attr(data, "var.labels")[match(date.var, names(data))]
)
date.table(
data,
date.limits,
format.date = "auto",
digits = 3,
var.labels = attr(data, "var.labels")[match(date.limits$date.var, names(data))]
)
rm.unwanted(
data,
limits = NULL,
num.limits = TRUE,
try.keep = TRUE,
stringAsFactors = TRUE
)
Arguments
data |
A data.frame where variables will be tested. |
variable |
A character vector of length one, indicating the name of the variable in the dataset to be tested. |
legal |
A character vector representeing the expected levels of the tested variable. |
var.labels |
Variables labels for a nice output. Must be informed in the same order as variable argument. By default, it captures the labels stored in attr(data, "var.labels"), if any. If not informed, the function returns the variables names. |
digits |
Number of decimal places for rounding. |
limits |
a list of two or more lists, each containing the arguments variable name and legal levels (in this order), to check on the factor variables. In the case of |
num.var |
A character vector indicating the name of a variable that should be numeric (although it can yet be formated as character or factor). |
num.max , num.min |
The maximal and minimal limits of acceptable range of a numeric variable. |
num.limits |
A data.frame with the following variables: num.var, num.max and num.min, representing the numeric variables names, maximal and minimal expected valid values. In the case of |
date.var |
A character vector indicating the name of a variable in data that should be a date (althoug it can yet be formated as character or factor). |
date.max , date.min |
The maximal and minimal limits of acceptable range of a date variable. |
format.date |
Default is "auto". If so, |
date.limits |
A |
try.keep |
Default is |
stringAsFactors |
In |
Author(s)
Lunna Borges & Pedro Brasil
See Also
Examples
# Simulating a dataset with 5 factor variables and assigning labels
y <- data.frame(Var1 = sample(c("Yes","No", "Ignored", "", "yes ", NA), 200, replace = TRUE),
Var2 = sample(c("Death","Discharge", "", NA), 200, replace = TRUE),
Var3 = sample(c(16:35, NA), 200, replace = TRUE),
Var4 = sample(c(12:300, "Female", "", NA), 200, replace = TRUE),
Var5 = sample(c(60:800), 200, replace = TRUE))
attr(y, "var.labels") <- c("Intervention use","Unit destination","BMI","Age","Cholesterol")
summary(y)
# Cheking the quality only the first variable
t_factor(y, "Var1", c("Yes","No","Ignored"))
# Checking two or more variables at once
factor.limits = list(list("Var1",c("Yes","No")),
list("Var2",c("Death","Discharge")))
factor.table(y, limits = factor.limits)
# Checking only one variable that shohuld be numeric
t_num(y,"Var3", num.min = 17, num.max = 32)
# Making the limits data.frame
num.limits <- data.frame(num.var = c("Var3","Var4","Var5"),
num.min = c(17,18,70), num.max = c(32,110,300))
num.limits
# Checking two or more numeric variables (or the ones that
# should be as numeric) at once
num.table(y, num.limits)
# Removing the unwanted values (extremes or not valid).
y <- rm.unwanted(data = y, limits = factor.limits,
num.limits = num.limits)
summary(y)
rm(y, num.limits, factor.limits)
#'
# Loading a dataset and assigning labels
data(icu)
attr(icu, "var.labels")[match(c("UnitAdmissionDateTime","UnitDischargeDateTime",
"HospitalAdmissionDate", "HospitalDischargeDate"), names(icu))] <-
c("Unit admission","Unit discharge","Hospital admission","Hospital discharge")
# Checking only one variable that should be a date.
t_date(icu, "HospitalDischargeDate", date.max = as.Date("2013-10-30"),
date.min = as.Date("2013-02-20"))
# Checking a date variable misspecifying the date format
# will cause the variable dates to be identified as non-date values.
t_date(data = icu, date.var = "HospitalDischargeDate",
date.max = as.Date("2013-10-30"),
date.min = as.Date("2013-02-20"),
format.date = "%d/%m/%Y")
# Making a limit data.frame assuming an 'auto' format.date
d.lim <- data.frame(date.var = c("UnitAdmissionDateTime","UnitDischargeDateTime",
"HospitalAdmissionDate","HospitalDischargeDate"),
date.min = rep(as.Date("2013-02-28"), 4),
date.max = rep(as.Date("2013-11-30"), 4))
d.lim
# Checking two or more date variables (or the ones that should be as date) at once
date.table(data = icu, date.limits = d.lim)
# Making a limit data.frame specifying format.date argument
# Here the the last 'format.date' is missspecified on purpose
# So, the last date will be identified as non-date values.
d.lim <- data.frame(date.var = c("UnitAdmissionDateTime","UnitDischargeDateTime",
"HospitalAdmissionDate","HospitalDischargeDate"),
date.min = rep(as.Date("2013-02-28"), 4),
date.max = rep(as.Date("2013-11-30"), 4),
format.date = c(rep("%Y/%m/%d",3), "%Y-%m-%d"))
d.lim
# Checking the quality of date variable with new limits.
# The 'format.date = ""' is required to force the function to look the format
# into the date.limits data.frame
date.table(data = icu, date.limits = d.lim, format.date = "")
rm(icu, d.lim)