catImp {mlmi} | R Documentation |
Imputation for categorical variables using log linear models
Description
This function performs multiple imputation under a log-linear model
as described by Schafer (1997), using his cat
package, either with
or without posterior draws.
Usage
catImp(
obsData,
M = 10,
pd = FALSE,
type = 1,
margins = NULL,
steps = 100,
rseed
)
Arguments
obsData |
The data frame to be imputed. Variables must be coded such that they take consecutive positive integer values, i.e. 1,2,3,... |
M |
Number of imputations to generate. |
pd |
Specify whether to use posterior draws ( |
type |
An integer specifying what type of log-linear model to impute using.
|
margins |
An optional argument that can be used instead of |
steps |
If |
rseed |
The value to set the |
Details
By default catImp
will impute using a log-linear model allowing for all two-way
associations, but not higher order associations. This can be modified through
use of the type
and margins
arguments.
With pd=FALSE
, all imputed datasets are generated conditional on the MLE
of the model parameter, referred to as maximum likelihood multiple imputation
by von Hippel and Bartlett (2021).
With pd=TRUE
, regular 'proper' multiple imputation
is used, where each imputation is drawn from a distinct value of the model
parameter. Specifically, for each imputation, a single MCMC chain is run,
iterating for steps
iterations.
Imputed datasets can be analysed using withinBetween
,
scoreBased
, or for example the
bootImpute package.
Value
A list of imputed datasets, or if M=1
, just the imputed data frame.
References
Schafer J.L. (1997). Analysis of incomplete multivariate data. Chapman & Hall, Boca Raton, Florida, USA.
von Hippel P.T. and Bartlett J.W. Maximum likelihood multiple imputation: faster, more efficient imputation without posterior draws. Statistical Science 2021; 36(3) 400-420 doi:10.1214/20-STS793.
Examples
#simulate a partially observed categorical dataset
set.seed(1234)
n <- 100
#for simplicity we simulate completely independent variables
temp <- data.frame(x1=ceiling(3*runif(n)), x2=ceiling(2*runif(n)), x3=ceiling(2*runif(n)))
#make some data missing
for (i in 1:3) {
temp[(runif(n)<0.25),i] <- NA
}
#impute using catImp, assuming two-way associations in the log-linear model
imps <- catImp(temp, M=10, pd=FALSE, rseed=4423)
#impute assuming a saturated log-linear model
imps <- catImp(temp, M=10, pd=FALSE, type=3, rseed=4423)