mlim {mlim} | R Documentation |
missing data imputation with automated machine learning
Description
imputes data.frame with mixed variable types using automated machine learning (AutoML)
Usage
mlim(
data = NULL,
m = 1,
algos = c("ELNET"),
postimpute = FALSE,
stochastic = m > 1,
ignore = NULL,
tuning_time = 900,
max_models = NULL,
maxiter = 10L,
cv = 10L,
matching = "AUTO",
autobalance = TRUE,
balance = NULL,
seed = NULL,
verbosity = NULL,
report = NULL,
tolerance = 0.001,
doublecheck = TRUE,
preimpute = "RF",
cpu = -1,
ram = NULL,
flush = FALSE,
preimputed.data = NULL,
save = NULL,
load = NULL,
shutdown = TRUE,
java = NULL,
...
)
Arguments
data |
a |
m |
integer, specifying number of multiple imputations. the default value is 1, carrying out a single imputation. |
algos |
character vector, specifying algorithms to be used for missing data imputation. supported algorithms are "ELNET", "RF", "GBM", "DL", "XGB", and "Ensemble". if more than one algorithm is specified, mlim changes behavior to save on runtime. for example, the default is "ELNET", which fine-tunes an Elastic Net model. In general, "ELNET" is expected to be the best algorithm because it fine-tunes very fast, it is very robust to over-fitting, and hence, it generalizes very well. However, if your data has many factor variables, each with several levels, it is recommended to have c("ELNET", "RF") as your imputation algorithms (and possibly add "Ensemble" as well, to make the most out of tuning the models). Note that code"XGB" is only available in Mac OS and Linux. moreover, "GBM", "DL" and "XGB" take the full given "tuning_time" (see below) to tune the best model for imputing he given variable, whereas "ELNET" will produce only one fine-tuned model, often at less time than other algorithms need for developing a single model, which is why "ELNET" is work horse of the mlim imputation package. |
postimpute |
(EXPERIMENTAL FEATURE) logical. if TRUE, mlim uses algorithms rather than 'ELNET' for carrying out postimputation optimization. however, if FALSE, all specified algorihms will be used in the process of 'reimputation' together. the 'Ensemble' algorithm is encouraged when other algorithms are used. However, for general users unspecialized in machine learning, postimpute is NOT recommended because this feature is currently experimental, prone to over-fitting, and highly computationally extensive. |
stochastic |
logical. by default it is set to TRUE for multiple imputation and FALSE for single imputation. stochastic argument is currently under testing and is intended to avoid inflating the correlation between imputed valuables. |
ignore |
character vector of column names or index of columns that should should be ignored in the process of imputation. |
tuning_time |
integer. maximum runtime (in seconds) for fine-tuning the
imputation model for each variable in each iteration. the default
time is 900 seconds but for a large dataset, you
might need to provide a larger model development
time. this argument also influences |
max_models |
integer. maximum number of models that can be generated in
the proecess of fine-tuning the parameters. this value
default to 100, meaning that for imputing each variable in
each iteration, up to 100 models can be fine-tuned. increasing
this value should be consistent with increasing
|
maxiter |
integer. maximum number of iterations. the default value is |
cv |
logical. specify number of k-fold Cross-Validation (CV). values of 10 or higher are recommended. default is 10. |
matching |
logical. if |
autobalance |
logical. if TRUE (default), binary and multinomial factor variables will be balanced before the imputation to obtain fairer and less-biased imputations, which are typically in favor of the majority class. if FALSE, imputation fairness will be sacrificed for overall accuracy, which is not recommended, although it is commonly practiced in other missing data imputation software. MLIM is highly concerned with imputation fairness for factor variables and autobalancing is generally recommended. in fact, higher overall accuracy does not mean a better imputation as long as minority classes are neglected, which increases the bias in favor of the majority class. if you do not wish to autobalance all the factor variables, you can manually specify the variables that should be balanced using the 'balance' argument (see below). |
balance |
character vector, specifying variable names that should be balanced before imputation. balancing the prevalence might decrease the overall accuracy of the imputation, because it attempts to ensure the representation of the rare outcome. this argument is optional and intended for advanced users that impute a severely imbalance categorical (nominal) variable. |
seed |
integer. specify the random generator seed |
verbosity |
character. controls how much information is printed to console. the value can be "warn" (default), "info", "debug", or NULL. to FALSE. |
report |
filename. if a filename is specified (e.g. report = "mlim.md"), the |
tolerance |
numeric. the minimum rate of improvement in estimated error metric
of a variable to qualify the imputation for another round of iteration,
if the |
doublecheck |
logical. default is TRUE (which is conservative). if FALSE, if the estimated imputation error of a variable does not improve, the variable will be not reimputed in the following iterations. in general, deactivating this argument will slightly reduce the imputation accuracy, however, it significantly reduces the computation time. if your dataset is large, you are advised to set this argument to FALSE. (EXPERIMENTAL: consider that by avoiding several iterations that marginally improve the imputation accuracy, you might gain higher accuracy by investing your computational resources in fine-tuning better algorithms such as "GBM") |
preimpute |
character. specifies the 'primary' procedure of handling the missing
data. before 'mlim' begins imputing the missing observations, they should
be prepared for the imputation algorithms and thus, they should be replaced
with some values.
the default procedure is a quick "RF", which models the missing
data with parallel Random Forest model. this is a very fast procedure,
which later on, will be replaced within the "reimputation" procedure (see below).
possible other alternative is |
cpu |
integer. number of CPUs to be dedicated for the imputation. the default takes all of the available CPUs. |
ram |
integer. specifies the maximum size, in Gigabytes, of the memory allocation. by default, all the available memory is used for the imputation. large memory size is particularly advised, especially for multicore processes. the more you give the more you get! |
flush |
logical (experimental). if TRUE, after each model, the server is cleaned to retrieve RAM. this feature is in testing mode and is currently set to FALSE by default, but it is recommended if you have limited amount of RAM or large datasets. |
preimputed.data |
data.frame. if you have used another software for missing data imputation, you can still optimize the imputation by handing the data.frame to this argument, which will bypass the "preimpute" procedure. |
save |
filename (with .mlim extension). if a filename is specified, an |
load |
filename (with .mlim extension). an object of class "mlim", which includes the data, arguments, and settings for re-running the imputation, from where it was previously stopped. the "mlim" object saves the current state of the imputation and is particularly recommended for large datasets or when the user specifies a computationally extensive settings (e.g. specifying several algorithms, increasing tuning time, etc.). |
shutdown |
logical. if TRUE, h2o server is closed after the imputation. the default is TRUE |
java |
character, specifying path to the executable 64bit Java JDK on the Microsoft Windows machines, if JDK is installed but the path environment variable is not set. |
... |
arguments that are used internally between 'mlim' and 'mlim.postimpute'. these arguments are not documented in the help file and are not intended to be used by end user. |
Value
a data.frame
, showing the
estimated imputation error from the cross validation within the data.frame's
attribution
Author(s)
E. F. Haghish
Examples
## Not run:
data(iris)
# add stratified missing observations to the data. to make the example run
# faster, I add NAs only to a single variable.
dfNA <- iris
dfNA$Species <- mlim.na(dfNA$Species, p = 0.1, stratify = TRUE, seed = 2022)
# run the ELNET single imputation (fastest imputation via 'mlim')
MLIM <- mlim(dfNA, shutdown = FALSE)
# in single imputation, you can estimate the imputation accuracy via cross validation RMSE
mlim.summarize(MLIM)
### or if you want to carry out ELNET multiple imputation with 5 datasets.
### next, to carry out analysis on the multiple imputation, use the 'mlim.mids' function
### minimum of 5 datasets
MLIM2 <- mlim(dfNA, m = 5)
mids <- mlim.mids(MLIM2, dfNA)
fit <- with(data=mids, exp=glm(Species ~ Sepal.Length, family = "binomial"))
res <- mice::pool(fit)
summary(res)
# you can check the accuracy of the imputation, if you have the original dataset
mlim.error(MLIM2, dfNA, iris)
## End(Not run)