mlim.na {mlim} | R Documentation |
add stratified/unstratified artificial missing observations
Description
to examine the performance of imputation algorithms, artificial missing data are added to datasets and then imputed, to compare the original observations with the imputed values. this function can add stratified or unstratified artificial missing data. stratified missing data can be particularly useful if your categorical or ordinal variables are imbalanced, i.e., one category appears at a much higher rate than others.
Usage
mlim.na(x, p = 0.1, stratify = FALSE, classes = NULL, seed = NULL)
Arguments
x |
data.frame. x must be strictly a data.frame and any other data.table classes will be rejected |
p |
percentage of missingness to be added to the data |
stratify |
logical. if TRUE (default), stratified sampling will be carried out, when adding NA values to 'factor' variables (either ordered or unordered). this feature makes evaluation of missing data imputation algorithms more fair, especially when the factor levels are imbalanced. |
classes |
character vector, specifying the variable classes that should be selected for adding NA values. the default value is NULL, meaning all variables will receive NA values with probability of 'p'. however, if you wish to add NA values only to a specific classes, e.g. 'numeric' variables or 'ordered' factors, specify them in this argument. e.g. write "classes = c('numeric', 'ordered')" if you wish to add NAs only to numeric and ordered factors. |
seed |
integer. a random seed number for reproducing the result (recommended) |
Value
data.frame
Author(s)
E. F. Haghish
Examples
## Not run:
# adding stratified NA to an atomic vector
x <- as.factor(c(rep("M", 100), rep("F", 900)))
table(mlim.na(x, p=.5, stratify = TRUE))
# adding unstratified NAs to all variables of a data.frame
data(iris)
mlim.na(iris, p=0.5, stratify = FALSE, seed = 1)
# or add stratified NAs only to factor variables, ignoring other variables
mlim.na(iris, p=0.5, stratify = TRUE, classes = "factor", seed = 1)
# or add NAs to numeric variables
mlim.na(iris, p=0.5, classes = "numeric", seed = 1)
## End(Not run)