mi_pre {pre} | R Documentation |
Fit a prediction rule ensemble to multiply-imputed data (experimental)
Description
Function mi_pre
derives a sparse ensemble of rules and/or
linear rules based on imputed data. The function is still experimental,
so use at own risk.
Usage
mi_pre(
formula,
data,
weights = NULL,
obs_ids = NULL,
compl_frac = NULL,
nfolds = 10L,
sampfrac = 0.5,
...
)
Arguments
formula |
a symbolic description of the model to be fit of the form
|
data |
A list of imputed datasets. The datasets must have identically-named columns, but need not have the same number of rows (this can happen, for example. if a bootstrap sampling approach had been employed for multiple imputation). |
weights |
A list of observation weights for each observation in each
imputed dataset. The list must have the same length as |
obs_ids |
A list of observation ids, corresponding to the id in the
original data, of each observation in each imputed dataset. Defaults to
|
compl_frac |
An optional list specifying the fraction of observed values
for each observation. This will be used to compute observation weights as
a function of the fraction of complete data per observations, as per
Wan et al. (2015), but note that this is only recommended for users who
know the risks (i.e., an analysis more like complete-case analysis).
If specified, the list must have
the same length as |
nfolds |
positive integer. Number of cross-validation folds to be used for
selecting the optimal value of the penalty parameter |
sampfrac |
numeric value |
... |
Further arguments to be passed to
|
Details
Experimental function to fit a prediction rule ensemble to
multiply imputed data. Essentially, it is a wrapper function around function
pre()
, the main differences relate to sampling for the tree induction
and fold assignment for estimation of the coefficients for the final ensemble.
Function mi_pre
implements a so-called stacking approach to the analysis
of imputed data (see also Wood et al., 2008), where imputed datasets are combined
into one large dataset.
In addition to adjustments of the sampling procedures, adjustments to observation
weight are made to counter the artificial inflation of sample size.
Observations which occur repeatedly across the imputed datasets will be completely in- or excluded from each sample or fold, to avoid overfitting. Thus, complete observations instead of individual imputed observations are sampled, for tree and rule induction, as well as the cross-validation for selecting the penalty parameter values for the final ensemble.
It is assumed that data have already been imputed (using e.g.,
R package mice or missForest), and therefore function mi_pre
takes a
list
of imputed datasets as input data.
Although the option to use the fraction of complete data for computing
observation weight is provided through argument compl_frac
, users
are not advised to use it. See e.g., Du et al. (2022): "An alternative weight
specification, proposed in Wan et al. (2015), is o_i = f_i/D, where f_i is
the number of observed predictors out of the total number of predictors for
subject i [...] upweighting subjects with less missingness and downweighting
subjects with more missingness can, in some sense, be viewed as making the
optimization more like complete-case analysis, which might be problematic
for Missing at Random (MAR) and Missing not at Random (MNAR) scenarios."
Value
An object of class pre
.
References
Du, J., Boss, J., Han, P., Beesley, L. J., Kleinsasser, M., Goutman, S.A., ... & Mukherjee, B. (2022). Variable selection with multiply-imputed datasets: choosing between stacked and grouped methods. Journal of Computational and Graphical Statistics, 31(4), 1063-1075. doi:10.1080/10618600.2022.2035739.
Wood, A. M., White, I. R., & Royston, P. (2008). How should variable selection be performed with multiply imputed data? Statistics in medicine, 27(17), 3227-3246. doi:10.1002/sim.3177
See Also
Examples
library("mice")
set.seed(42)
## Shoot extra holes in airquality data
airq <- sapply(airquality, function(x) {
x[sample(1:nrow(airquality), size = 25)] <- NA
return(x)
})
## impute the data
imp <- mice(airq, m = 5)
imp <- as.list(complete(imp, action = "all"))
## fit a rule ensemble to the imputed data
set.seed(42)
airq.ens.mi <- mi_pre(Ozone ~ . , data = imp)