missCforest {missCforest} | R Documentation |
Ensemble Conditional Trees for Missing Data Imputation
Description
Single imputation based on the Ensemble Conditional Trees Cforest algorithm.
Usage
missCforest(
dat,
formula = . ~ .,
ntree = 100L,
minsplit = 20L,
minbucket = 7L,
alpha = 0.05,
cores = 1
)
Arguments
dat |
|
formula |
|
ntree |
number of trees to grow for the forest. |
minsplit |
minimum sum of weights in a node in order to be considered for splitting in a single tree. |
minbucket |
minimum sum of weights in a terminal node of a single tree. |
alpha |
statistical significance level (alpha). |
cores |
number of cores to use or in most cases how many child processes will be run simultaneously. This option is initialized at 4 to ensure fast execution. |
Value
complete (i.e. imputed) data.frame.
Imputation model specification
Formula for defining the imputation model is of the form
[imputed_variables ~ predictors]
The variables to be imputed are specified on the left-side and
the predictors to be used for imputation are specified on the right-side of the formula.
The user can specify a customized imputation model using the formula argument.
By default, latter is set to [. ~ .]
which corresponds to the situation where all variables that contain missing values will be imputed by the rest of variables.
Details
missCforest can be used for numerical, categorical, or mixed-type data imputation. Missing values are imputed through ensemble prediction using Conditional Inference Trees (Ctree) as base learners (Hothorn, Hornik, and Zeileis 2006). Ctree is a non-parametric class of regression and classification trees embedding recursive partitioning into the theory of conditional inference (Strasser and Weber 1999). The missCforest algorithm redefines the imputation problem as a prediction one using single imputation approach. Iteratively, missing values are predicted based on the the complete cases set updated at each iteration. No stopping criterion is pre-defined, the imputation process ends when the missing data are all imputed. This algorithm is robust to outliers and gives a particular attention to the association structure between covariates (i.e. variables used for imputation) and th outcome (i.e. variable to be imputed) since the recursive partitioning of Conditional Trees is based on the multiple tests procedures.
References
Hothorn T, Hornik K, Zeileis A (2006). "Unbiased Recursive Partitioning: A Conditional Inference Framework" Journal of Computational and Graphical Statistics, 15(3), 651–674.
Strobl, C., Boulesteix, A. L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1), 1-21.
Strasser H, Weber C (1999). "On the Asymptotic Theory of Permutation Statistics." Mathematical Methods of Statistics, 8, 220–250.
Examples
library(missCforest)
# import the iris dataset
data(iris)
# introduce randomly 30% of NA to variables
irisNA <- generateNA(iris, 0.3)
summary(irisNA)
# impute all the missing values using all the possible combinations of the imputation model formula
irisImp <- missCforest(irisNA, .~.)
summary(irisImp)