missForest {missForestPredict} | R Documentation |
Imputes a dataframe and returns imputation models to be used on new observations
Description
Imputes a dataframe and (if save_models = TRUE) returns imputation models to be used on new observations.
Usage
missForest(
xmis,
maxiter = 10,
fixed_maxiter = FALSE,
var_weights = NULL,
decreasing = FALSE,
initialization = "mean/mode",
x_init = NULL,
class.weights = NULL,
return_integer_as_integer = FALSE,
save_models = TRUE,
predictor_matrix = NULL,
proportion_usable_cases = c(1, 0),
verbose = TRUE,
convergence_error = "OOB",
...
)
Arguments
xmis |
dataframe containing missing values of class dataframe ("tibble" class tbl_df is also supported). Matrix format is not supported. See details for column format. |
maxiter |
maximum number of iterations. By default the algorithm will stop when converge is reached or after running for maxiter, whichever occurs first. |
fixed_maxiter |
if set to TRUE, the algorithm will run for the exact number of iterations specified in maxiter, regardless of the convergence criteria. Default is FALSE. |
var_weights |
named vector of weights for each variable in the convergence criteria. The names should correspond to variable names. By default the weights are set to the proportion of missing values on each variable. |
decreasing |
(boolean) if TRUE the order in which the variables are imputed is by decreasing amount of missing values. (the variable with highest amount of missing values will be imputed first). If FALSE the variable with lowest amount of missing values will be imputed first. |
initialization |
initialization method before running RF models; supported: mean/mode, median/mode and custom. Default is mean/mode. |
x_init |
if |
class.weights |
a named list containing |
return_integer_as_integer |
Internally, integer columns are treated as double (double precision floating point numbers). If TRUE, the imputations will be rounded to closest integer and returned as integer (This might be desirable for count variables). If FALSE, integer columns will be returned as double (This might be desirable, for example, for patient age imputation). Default is FALSE. The same behaviour will be applied to new observations when using missForestPredict. |
save_models |
if TRUE, imputation models are saved and a new observation (or a test set) can be imputed using the models learned; saving models on a dataset with a high number of variables will occupy RAM memory on the machine. Default is TRUE. |
predictor_matrix |
predictor matrix indicating which variables to use in the imputation of each variable.
See documentation for function |
proportion_usable_cases |
a vector with two components: the first one is a minimum threshold for |
verbose |
(boolean) if TRUE then missForest returns OOB error estimates (MSE and NMSE) and runtime. |
convergence_error |
Which error should be used for the convergence criterion. Supported values: OOB and apparent. If a different value is provided, it defaults to OOB. See vignette for full details on convergence. |
... |
other arguments passed to ranger function (some arguments that are specific to each variable type are not supported).
See vignette for |
Details
An adaptation of the original missForest algorithm (Stekhoven et al. 2012) is used. Variables are initialized with a mean/mode, median/mode or custom imputation. Then, they are imputed iteratively "on the fly" for a maximum number of iterations or until the convergence criteria are met. The imputation sequence is either increasing or decreasing. At each iteration, a random forest model is build for each variable using as outcome on the observed (non-missing) values of the variable and as predictors the values of the other variables from previous iteration for the first variable in the sequence or current iteration for next variables in the sequence (on-the-fly). The ranger package (Wright et al. 2017) is used for building the random forest models.
The convergence criterion is based on the out-of-boostrap (OOB) error or the apparent error and uses NMSE (normalized mean squared error) for both continuous and categorical variables.
Imputation models for all variables and all iterations are saved (if save_models
is TRUE) and can be later
applied to new observations.
Both dataframe and tibble (tbl_df class) are supported as input. The imputed dataframe will be retured with the same class. Numeric and integer columns are supported and treated internally as continuous variables. Factor and character columns are supported and treated internally as categorical variables. Other types (like boolean or dates) are not supported. NA values are considered missing values.
Value
Object of class missForest
with elements
ximp |
dataframe with imputed values |
init |
x_init if custom initalization is used; otherwise list of mean/mode or median/mode for each variable |
initialization |
value of initialization parameter |
impute_sequence |
vector variable names in the order in which imputation has been run |
maxiter |
maxiter parameter as passed to the function |
models |
list of random forest models for each iteration |
return_integer_as_integer |
Parameter return_integer_as_integer as passed to the function |
integer_columns |
list of columns of integer type in the data |
OOB_err |
dataframe with out-of-bag errors for each iteration and each variable |
References
Stekhoven, D. J., & Bühlmann, P. (2012). MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118. doi: 10.1093/bioinformatics/btr597
Wright, M. N. & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. J Stat Softw 77:1-17. doi: 10.18637/jss.v077.i01.
Examples
data(iris)
iris_mis <- produce_NA(iris, proportion = 0.1)
imputation_object <- missForest(iris_mis, num.threads = 2)
iris_imp <- imputation_object$ximp