aregImpute {Hmisc}  R Documentation 
Multiple Imputation using Additive Regression, Bootstrapping, and Predictive Mean Matching
Description
The transcan
function creates flexible additive imputation models
but provides only an approximation to true multiple imputation as the
imputation models are fixed before all multiple imputations are
drawn. This ignores variability caused by having to fit the
imputation models. aregImpute
takes all aspects of uncertainty in
the imputations into account by using the bootstrap to approximate the
process of drawing predicted values from a full Bayesian predictive
distribution. Different bootstrap resamples are used for each of the
multiple imputations, i.e., for the i
th imputation of a sometimes
missing variable, i=1,2,... n.impute
, a flexible additive
model is fitted on a sample with replacement from the original data and
this model is used to predict all of the original missing and
nonmissing values for the target variable.
areg
is used to fit the imputation models. By default, linearity
is assumed for target variables (variables being imputed) and
nk=3
knots are assumed for continuous predictors transformed
using restricted cubic splines. If nk
is three or greater and
tlinear
is set to FALSE
, areg
simultaneously finds transformations of the target variable and of all of
the predictors, to get a good fit assuming additivity, maximizing
R^2
, using the same canonical correlation method as
transcan
. Flexible transformations may be overridden for
specific variables by specifying the identity transformation for them.
When a categorical variable is being predicted, the flexible
transformation is Fisher's optimum scoring method. Nonlinear transformations for continuous variables may be nonmonotonic. If
nk
is a vector, areg
's bootstrap and crossval=10
options will be used to help find the optimum validating value of
nk
over values of that vector, at the last imputation iteration.
For the imputations, the minimum value of nk
is used.
Instead of defaulting to taking random draws from fitted imputation
models using random residuals as is done by transcan
,
aregImpute
by default uses predictive mean matching with optional
weighted probability sampling of donors rather than using only the
closest match. Predictive mean matching works for binary, categorical,
and continuous variables without the need for iterative maximum
likelihood fitting for binary and categorical variables, and without the
need for computing residuals or for curtailing imputed values to be in
the range of actual data. Predictive mean matching is especially
attractive when the variable being imputed is also being transformed
automatically. Constraints may be placed on variables being imputed
with predictive mean matching, e.g., a missing hospital discharge date
may be required to be imputed from a donor observation whose discharge
date is before the recipient subject's first postdischarge visit date.
See Details below for more information about the
algorithm. A "regression"
method is also available that is
similar to that used in transcan
. This option should be used
when mechanistic missingness requires the use of extrapolation during
imputation.
A print
method summarizes the results, and a plot
method plots
distributions of imputed values. Typically, fit.mult.impute
will
be called after aregImpute
.
If a target variable is transformed nonlinearly (i.e., if nk
is
greater than zero and tlinear
is set to FALSE
) and the
estimated target variable transformation is nonmonotonic, imputed
values are not unique. When type='regression'
, a random choice
of possible inverse values is made.
The reformM
function provides two ways of recreating a formula to
give to aregImpute
by reordering the variables in the formula.
This is a modified version of a function written by Yong Hao Pua. One
can specify nperm
to obtain a list of nperm
randomly
permuted variables. The list is converted to a single ordinary formula
if nperm=1
. If nperm
is omitted, variables are sorted in
descending order of the number of NA
s. reformM
also
prints a recommended number of multiple imputations to use, which is a
minimum of 5 and the percent of incomplete observations.
Usage
aregImpute(formula, data, subset, n.impute=5, group=NULL,
nk=3, tlinear=TRUE, type=c('pmm','regression','normpmm'),
pmmtype=1, match=c('weighted','closest','kclosest'),
kclosest=3, fweighted=0.2,
curtail=TRUE, constraint=NULL,
boot.method=c('simple', 'approximate bayesian'),
burnin=3, x=FALSE, pr=TRUE, plotTrans=FALSE, tolerance=NULL, B=75)
## S3 method for class 'aregImpute'
print(x, digits=3, ...)
## S3 method for class 'aregImpute'
plot(x, nclass=NULL, type=c('ecdf','hist'),
datadensity=c("hist", "none", "rug", "density"),
diagnostics=FALSE, maxn=10, ...)
reformM(formula, data, nperm)
Arguments
formula 
an S model formula. You can specify restrictions for transformations
of variables. The function automatically determines which variables
are categorical (i.e., 
x 
an object created by 
data 
input raw data 
subset 
These may be also be specified. You may not specify 
n.impute 
number of multiple imputations. 
group 
a character or factor variable the same length as the
number of observations in 
nk 
number of knots to use for continuous variables. When both
the target variable and the predictors are having optimum
transformations estimated, there is more instability than with normal
regression so the complexity of the model should decrease more sharply
as the sample size decreases. Hence set 
tlinear 
set to 
type 
The default is 
pmmtype 
type of matching to be used for predictive mean
matching when 
match 
Defaults to 
kclosest 
see 
fweighted 
Smoothing parameter (multiple of mean absolute difference) used when

curtail 
applies if 
constraint 
for predictive mean matching 
boot.method 
By default, simple boostrapping is used in which the
target variable is predicted using a sample with replacement from the
observations with nonmissing target variable. Specify

burnin 

pr 
set to 
plotTrans 
set to 
tolerance 
singularity criterion; list the source code in the

B 
number of bootstrap resamples to use if 
digits 
number of digits for printing 
nclass 
number of bins to use in drawing histogram 
datadensity 
see 
diagnostics 
Specify 
maxn 
Maximum number of observations shown for diagnostics. Default is

nperm 
number of random formula permutations for 
... 
other arguments that are ignored 
Details
The sequence of steps used by the aregImpute
algorithm is the
following.
(1) For each variable containing m NA
s where m > 0, initialize the
NA
s to values from a random sample (without replacement if
a sufficient number of nonmissing values exist) of size m from the
nonmissing values.
(2) For burnin+n.impute
iterations do the following steps. The
first burnin
iterations provide a burnin, and imputations are
saved only from the last n.impute
iterations.
(3) For each variable containing any NA
s, draw a sample with
replacement from the observations in the entire dataset in which the
current variable being imputed is nonmissing. Fit a flexible
additive model to predict this target variable while finding the
optimum transformation of it (unless the identity
transformation is forced). Use this fitted flexible model to
predict the target variable in all of the original observations.
Impute each missing value of the target variable with the observed
value whose predicted transformed value is closest to the predicted
transformed value of the missing value (if match="closest"
and
type="pmm"
),
or use a draw from a multinomial distribution with probabilities derived
from distance weights, if match="weighted"
(the default).
(4) After these imputations are computed, use these random draw
imputations the next time the curent target variable is used as a
predictor of other sometimesmissing variables.
When match="closest"
, predictive mean matching does not work well
when fewer than 3 variables are used to predict the target variable,
because many of the multiple imputations for an observation will be
identical. In the extreme case of one righthandside variable and
assuming that only monotonic transformations of left and rightside
variables are allowed, every bootstrap resample will give predicted
values of the target variable that are monotonically related to
predicted values from every other bootstrap resample. The same is true
for Bayesian predicted values. This causes predictive mean matching to
always match on the same donor observation.
When the missingness mechanism for a variable is so systematic that the
distribution of observed values is truncated, predictive mean matching
does not work. It will only yield imputed values that are near observed
values, so intervals in which no values are observed will not be
populated by imputed values. For this case, the only hope is to make
regression assumptions and use extrapolation. With
type="regression"
, aregImpute
will use linear
extrapolation to obtain a (hopefully) reasonable distribution of imputed
values. The "regression"
option causes aregImpute
to
impute missing values by adding a random sample of residuals (with
replacement if there are more NA
s than measured values) on the
transformed scale of the target variable. After random residuals are
added, predicted random draws are obtained on the original untransformed
scale using reverse linear interpolation on the table of original and
transformed target values (linear extrapolation when a random residual
is large enough to put the random draw prediction outside the range of
observed values). The bootstrap is used as with type="pmm"
to
factor in the uncertainty of the imputation model.
As model uncertainty is high when the transformation of a target
variable is unknown, tlinear
defaults to TRUE
to limit the
variance in predicted values when nk
is positive.
Value
a list of class "aregImpute"
containing the following elements:
call 
the function call expression 
formula 
the formula specified to 
match 
the 
fweighted 
the 
n 
total number of observations in input dataset 
p 
number of variables 
na 
list of subscripts of observations for which values were originally missing 
nna 
named vector containing the numbers of missing values in the data 
type 
vector of types of transformations used for each variable
( 
tlinear 
value of 
nk 
number of knots used for smooth transformations 
cat.levels 
list containing character vectors specifying the 
df 
degrees of freedom (number of parameters estimated) for each variable 
n.impute 
number of multiple imputations per missing value 
imputed 
a list containing matrices of imputed values in the same format as
those created by 
x 
if 
rsq 
for the last round of imputations, a vector containing the Rsquares
with which each sometimesmissing variable could be predicted from the
others by 
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
References
van Buuren, Stef. Flexible Imputation of Missing Data. Chapman & Hall/CRC, Boca Raton FL, 2012.
Little R, An H. Robust likelihoodbased analysis of multivariate data with missing values. Statistica Sinica 14:949968, 2004.
van Buuren S, Brand JPL, GroothuisOudshoorn CGM, Rubin DB. Fully conditional specifications in multivariate imputation. J Stat Comp Sim 72:10491064, 2006.
de Groot JAH, Janssen KJM, Zwinderman AH, Moons KGM, Reitsma JB. Multiple imputation to correct for partial verification bias revisited. Stat Med 27:58805889, 2008.
Siddique J. Multiple imputation using an iterative hotdeck with distancebased donor selection. Stat Med 27:83102, 2008.
White IR, Royston P, Wood AM. Multiple imputation using chained equations: Issues and guidance for practice. Stat Med 30:377399, 2011.
Curnow E, Carpenter JR, Heron JE, et al: Multiple imputation of missing data under missing at random: compatible imputation models are not sufficient to avoid bias if they are misspecified. J Clin Epi June 9, 2023. DOI:10.1016/j.jclinepi.2023.06.011.
See Also
fit.mult.impute
, transcan
, areg
, naclus
, naplot
, mice
,
dotchart3
, Ecdf
, completer
Examples
# Check that aregImpute can almost exactly estimate missing values when
# there is a perfect nonlinear relationship between two variables
# Fit restricted cubic splines with 4 knots for x1 and x2, linear for x3
set.seed(3)
x1 < rnorm(200)
x2 < x1^2
x3 < runif(200)
m < 30
x2[1:m] < NA
a < aregImpute(~x1+x2+I(x3), n.impute=5, nk=4, match='closest')
a
matplot(x1[1:m]^2, a$imputed$x2)
abline(a=0, b=1, lty=2)
x1[1:m]^2
a$imputed$x2
# Multiple imputation and estimation of variances and covariances of
# regression coefficient estimates accounting for imputation
# Example 1: large sample size, much missing data, no overlap in
# NAs across variables
x1 < factor(sample(c('a','b','c'),1000,TRUE))
x2 < (x1=='b') + 3*(x1=='c') + rnorm(1000,0,2)
x3 < rnorm(1000)
y < x2 + 1*(x1=='c') + .2*x3 + rnorm(1000,0,2)
orig.x1 < x1[1:250]
orig.x2 < x2[251:350]
x1[1:250] < NA
x2[251:350] < NA
d < data.frame(x1,x2,x3,y, stringsAsFactors=TRUE)
# Find value of nk that yields best validating imputation models
# tlinear=FALSE means to not force the target variable to be linear
f < aregImpute(~y + x1 + x2 + x3, nk=c(0,3:5), tlinear=FALSE,
data=d, B=10) # normally B=75
f
# Try forcing target variable (x1, then x2) to be linear while allowing
# predictors to be nonlinear (could also say tlinear=TRUE)
f < aregImpute(~y + x1 + x2 + x3, nk=c(0,3:5), data=d, B=10)
f
## Not run:
# Use 100 imputations to better check against individual true values
f < aregImpute(~y + x1 + x2 + x3, n.impute=100, data=d)
f
par(mfrow=c(2,1))
plot(f)
modecat < function(u) {
tab < table(u)
as.numeric(names(tab)[tab==max(tab)][1])
}
table(orig.x1,apply(f$imputed$x1, 1, modecat))
par(mfrow=c(1,1))
plot(orig.x2, apply(f$imputed$x2, 1, mean))
fmi < fit.mult.impute(y ~ x1 + x2 + x3, lm, f,
data=d)
sqrt(diag(vcov(fmi)))
fcc < lm(y ~ x1 + x2 + x3)
summary(fcc) # SEs are larger than from mult. imputation
## End(Not run)
## Not run:
# Example 2: Very discriminating imputation models,
# x1 and x2 have some NAs on the same rows, smaller n
set.seed(5)
x1 < factor(sample(c('a','b','c'),100,TRUE))
x2 < (x1=='b') + 3*(x1=='c') + rnorm(100,0,.4)
x3 < rnorm(100)
y < x2 + 1*(x1=='c') + .2*x3 + rnorm(100,0,.4)
orig.x1 < x1[1:20]
orig.x2 < x2[18:23]
x1[1:20] < NA
x2[18:23] < NA
#x2[21:25] < NA
d < data.frame(x1,x2,x3,y, stringsAsFactors=TRUE)
n < naclus(d)
plot(n); naplot(n) # Show patterns of NAs
# 100 imputations to study them; normally use 5 or 10
f < aregImpute(~y + x1 + x2 + x3, n.impute=100, nk=0, data=d)
par(mfrow=c(2,3))
plot(f, diagnostics=TRUE, maxn=2)
# Note: diagnostics=TRUE makes graphs similar to those made by:
# r < range(f$imputed$x2, orig.x2)
# for(i in 1:6) { # use 1:2 to mimic maxn=2
# plot(1:100, f$imputed$x2[i,], ylim=r,
# ylab=paste("Imputations for Obs.",i))
# abline(h=orig.x2[i],lty=2)
# }
table(orig.x1,apply(f$imputed$x1, 1, modecat))
par(mfrow=c(1,1))
plot(orig.x2, apply(f$imputed$x2, 1, mean))
fmi < fit.mult.impute(y ~ x1 + x2, lm, f,
data=d)
sqrt(diag(vcov(fmi)))
fcc < lm(y ~ x1 + x2)
summary(fcc) # SEs are larger than from mult. imputation
## End(Not run)
## Not run:
# Study relationship between smoothing parameter for weighting function
# (multiplier of mean absolute distance of transformed predicted
# values, used in tricube weighting function) and standard deviation
# of multiple imputations. SDs are computed from average variances
# across subjects. match="closest" same as match="weighted" with
# small value of fweighted.
# This example also shows problems with predicted mean
# matching almost always giving the same imputed values when there is
# only one predictor (regression coefficients change over multiple
# imputations but predicted values are virtually 11 functions of each
# other)
set.seed(23)
x < runif(200)
y < x + runif(200, .05, .05)
r < resid(lsfit(x,y))
rmse < sqrt(sum(r^2)/(2002)) # sqrt of residual MSE
y[1:20] < NA
d < data.frame(x,y)
f < aregImpute(~ x + y, n.impute=10, match='closest', data=d)
# As an aside here is how to create a completed dataset for imputation
# number 3 as fit.mult.impute would do automatically. In this degenerate
# case changing 3 to 12,410 will not alter the results.
imputed < impute.transcan(f, imputation=3, data=d, list.out=TRUE,
pr=FALSE, check=FALSE)
sd < sqrt(mean(apply(f$imputed$y, 1, var)))
ss < c(0, .01, .02, seq(.05, 1, length=20))
sds < ss; sds[1] < sd
for(i in 2:length(ss)) {
f < aregImpute(~ x + y, n.impute=10, fweighted=ss[i])
sds[i] < sqrt(mean(apply(f$imputed$y, 1, var)))
}
plot(ss, sds, xlab='Smoothing Parameter', ylab='SD of Imputed Values',
type='b')
abline(v=.2, lty=2) # default value of fweighted
abline(h=rmse, lty=2) # root MSE of residuals from linear regression
## End(Not run)
## Not run:
# Do a similar experiment for the Titanic dataset
getHdata(titanic3)
h < lm(age ~ sex + pclass + survived, data=titanic3)
rmse < summary(h)$sigma
set.seed(21)
f < aregImpute(~ age + sex + pclass + survived, n.impute=10,
data=titanic3, match='closest')
sd < sqrt(mean(apply(f$imputed$age, 1, var)))
ss < c(0, .01, .02, seq(.05, 1, length=20))
sds < ss; sds[1] < sd
for(i in 2:length(ss)) {
f < aregImpute(~ age + sex + pclass + survived, data=titanic3,
n.impute=10, fweighted=ss[i])
sds[i] < sqrt(mean(apply(f$imputed$age, 1, var)))
}
plot(ss, sds, xlab='Smoothing Parameter', ylab='SD of Imputed Values',
type='b')
abline(v=.2, lty=2) # default value of fweighted
abline(h=rmse, lty=2) # root MSE of residuals from linear regression
## End(Not run)
set.seed(2)
d < data.frame(x1=runif(50), x2=c(rep(NA, 10), runif(40)),
x3=c(runif(4), rep(NA, 11), runif(35)))
reformM(~ x1 + x2 + x3, data=d)
reformM(~ x1 + x2 + x3, data=d, nperm=2)
# Give result or one of the results as the first argument to aregImpute
# Constrain imputed values for two variables
# Require imputed values for x2 to be above 0.2
# Assume x1 is never missing and require imputed values for
# x3 to be less than the recipient's value of x1
a < aregImpute(~ x1 + x2 + x3, data=d,
constraint=list(x2 = expression(d$x2 > 0.2),
x3 = expression(d$x3 < r$x1)))
a