mice.impute.SuperLearner {superMICE} | R Documentation |
SuperLearner method for mice
package.
Description
Method for the mice
package that uses SuperLearner as the predctive
algorithm. Model fitting is done using the SuperLearner
package.
Usage
mice.impute.SuperLearner(
y,
ry,
x,
wy = NULL,
SL.library,
kernel = c("gaussian", "uniform", "triangular"),
bw = c(0.1, 0.2, 0.25, 0.3, 0.5, 1, 2.5, 5, 10, 20),
bw.update = TRUE,
...
)
Arguments
y |
Vector to be imputed |
ry |
Logical vector of length |
x |
Numeric design matrix with |
wy |
Logical vector of length |
SL.library |
For SuperLearner: Either a character vector of prediction algorithms or list containing character vectors as specified by the SuperLearner package. See details below. |
kernel |
One of "gaussian", "uniform", "triangular". Kernel function used to compute weights. |
bw |
NULL or numeric value for bandwidth of kernel function (as standard deviations of the kernel). |
bw.update |
logical indicating whether bandwidths should be computed
every iteration or only on the first iteration. Default is |
... |
Further arguments passed to |
Details
mice.impute.SuperLearner()
is a method for use with the mice() function that
implements the ensemble predictive model, SuperLearner (van der Laan, 2011),
into the mice (van Buuren, 2011) multiple imputation procedure. This function
is never called directly, instead a user that wishes to use SuperLearner
in MICE simply needs to set the argument method = "SuperLearner"
in the
call to mice()
. Arguments for the SuperLearner()
function are passed from mice as extra arguments in the mice()
call.
All MICE methods randomly generate imputed values for a number of data sets.
The approach of SuperMICE is to estimate parameters for a normal distribution
centered at the point estimate for an imputed value predicted by a SuperLearner model.
The point estimates are obtained by fitting a selection of different
predictive models on complete cases and determining an optimal weighted average
of candidate models to predict the missing cases. SuperMICE uses the implementation
of SuperLearner found in the SuperLearner package.
The models to be used with SuperLearner()
are supplied by the user as a
character vector. For a full list of available methods see
listWrappers()
.
SuperLearner models do not produce standard errors for estimates, so instead
we use a kernel based estimate of local variance around each point estimate
as the variance parameter in the normal distribution used to randomly sample values.
The kernel can be set by the user with the kernel
argument as either
a gaussian kernel, uniform kernel, or triangular kernel. The user must also
supply a list of candidate bandwidths in the bw
argument as a numeric
vector. For more information on the variance and bandwidth selection
see Laqueur, et. al (2021). In every iteration the mice procedure, the optimal
bandwidth is reselected. This may be changed to select the bandwidth only
on the first iteration to speed up the total run time of the imputation by
changing bw.update
to FALSE
; however this may bias your results.
Note that this only applies to continuous response variables. In the binary
case the variance is a function of the SuperLearner estimate.
Value
Vector with imputed data, same type as y
, and of length sum(wy)
References
Laqueur, H. S., Shev, A. B., Kagawa, R. M. C. (2021). SuperMICE: An Ensemble Machine Learning Approach to Multiple Imputation by Chained Equations. American Journal of Epidemiology, kwab271, doi:10.1093/aje/kwab271.
Van Buuren, S., Groothuis-Oudshoorn, K. (2011). mice
: Multivariate
Imputation by Chained Equations in R
. Journal of Statistical Software,
45(3), 1-67. doi:10.18637/jss.v045.i03.
van der Laan, M. J., Polley, E. C. and Hubbard, A. E. (2008) Super Learner, Statistical Applications of Genetics and Molecular Biology, 6, article 25.
See Also
Examples
#Multiple imputation with missingness on a continuous variable.
#Randomly generated data with missingness in x2. The probability of x2
# being missing increases with with value of x1.
n <- 20
pmissing <- 0.10
x1 <- runif(n, min = -3, max = 3)
x2 <- x1^2 + rnorm(n, mean = 0, sd = 1)
error <- rnorm(n, mean = 0, sd = 1)
y <- x1 + x2 + error
f <- ecdf(x1)
x2 <- ifelse(runif(x2) < (f(x1) * 2 * pmissing), NA, x2)
dat <- data.frame(y, x1, x2)
#Create vector of SuperLearner method names
# Note: see SuperLearner::listWrappers() for a full list of methods
# available.
SL.lib <- c("SL.mean", "SL.glm")
#Run mice().
# Note 1: m >= 30 and maxit >= 10 are recommended outside of this
# toy example
# Note 2: a denser bandwidth grid is recommended, see default for bw
# argument for example.
imp.SL <- mice::mice(dat, m = 2, maxit = 2,
method = "SuperLearner",
print = TRUE, SL.library = SL.lib,
kernel = "gaussian",
bw = c(0.25, 1, 5))