missRanger {missRanger} | R Documentation |
Fast Imputation of Missing Values by Chained Random Forests
Description
Uses the "ranger" package (Wright & Ziegler) to do fast missing value imputation by
chained random forests, see Stekhoven & Buehlmann and Van Buuren & Groothuis-Oudshoorn.
Between the iterative model fitting, it offers the option of predictive mean matching.
This firstly avoids imputation with values not present in the original data
(like a value 0.3334 in a 0-1 coded variable).
Secondly, predictive mean matching tries to raise the variance in the resulting
conditional distributions to a realistic level. This allows to do multiple imputation
when repeating the call to missRanger()
.
Usage
missRanger(
data,
formula = . ~ .,
pmm.k = 0L,
maxiter = 10L,
seed = NULL,
verbose = 1,
returnOOB = FALSE,
case.weights = NULL,
data_only = TRUE,
keep_forests = FALSE,
...
)
Arguments
data |
A |
formula |
A two-sided formula specifying variables to be imputed
(left hand side) and variables used to impute (right hand side).
Defaults to |
pmm.k |
Number of candidate non-missing values to sample from in the predictive mean matching steps. 0 to avoid this step. |
maxiter |
Maximum number of chaining iterations. |
seed |
Integer seed to initialize the random generator. |
verbose |
Controls how much info is printed to screen.
0 to print nothing. 1 (default) to print a progress bar per iteration,
2 to print the OOB prediction error per iteration and variable
(1 minus R-squared for regression).
Furthermore, if |
returnOOB |
Logical flag. If TRUE, the final average out-of-bag prediction
errors per variable is added to the resulting data as attribute "oob".
Only relevant when |
case.weights |
Vector with non-negative case weights. |
data_only |
If |
keep_forests |
Should the random forests of the final imputations
be returned? The default is |
... |
Arguments passed to |
Details
The iterative chaining stops as soon as maxiter
is reached or if the average
out-of-bag (OOB) prediction errors stop reducing.
In the latter case, except for the first iteration, the second last (= best)
imputed data is returned.
OOB prediction errors are quantified as 1 - R^2 for numeric variables, and as classification error otherwise. If a variable has been imputed only univariately, the value is 1.
A note on mtry
: Be careful when passing a non-default mtry
to
ranger::ranger()
because the number of available covariates might be growing during
the first iteration, depending on the missing pattern.
Values NULL
(default) and 1 are safe choices.
Additionally, recent versions of ranger::ranger()
allow mtry
to be a
single-argument function of the number of available covariables,
e.g., mtry = function(m) max(1, m %/% 3)
.
Value
If data_only
an imputed data.frame
. Otherwise, a "missRanger" object with
the following elements that can be extracted via $
:
-
data
: The imputed data. -
forests
: Whenkeep_forests = TRUE
, a list of "ranger" models used to generate the imputed data.NULL
otherwise. -
visit_seq
: Variables to be imputed (in this order). -
impute_by
: Variables used for imputation. -
best_iter
: Best iteration. -
pred_errors
: Per-iteration OOB prediction errors (1 - R^2 for regression, classification error otherwise). -
mean_pred_errors
: Per-iteration averages of OOB prediction errors.
References
Wright, M. N. & Ziegler, A. (2016). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, in press. <arxiv.org/abs/1508.04409>.
Stekhoven, D.J. and Buehlmann, P. (2012). 'MissForest - nonparametric missing value imputation for mixed-type data', Bioinformatics, 28(1) 2012, 112-118. https://doi.org/10.1093/bioinformatics/btr597.
Van Buuren, S., Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. http://www.jstatsoft.org/v45/i03/
Examples
irisWithNA <- generateNA(iris, seed = 34)
irisImputed <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100)
head(irisImputed)
head(irisWithNA)
## Not run:
# Extended output
imp <- missRanger(irisWithNA, pmm.k = 3, num.trees = 100, data_only = FALSE)
head(imp$data)
imp$pred_errors
# If you even want to keep the random forests of the best iteration
imp <- missRanger(
irisWithNA, pmm.k = 3, num.trees = 100, data_only = FALSE, keep_forests = TRUE
)
imp$forests$Sepal.Width
imp$pred_errors[imp$best_iter, "Sepal.Width"] # 1 - R-squared
## End(Not run)