R: Parallel calculations for Multivariate Imputation by Chained...

mice.par {micemd}

R Documentation

Parallel calculations for Multivariate Imputation by Chained Equations

Description

Parallel calculations for Multivariate Imputation by Chained Equations using the R package parallel.

Usage

mice.par(don.na, m = 5, method = NULL, predictorMatrix, where = NULL,
visitSequence = NULL, blots = NULL, post = NULL, blocks, formulas,
defaultMethod = c("pmm", "logreg", "polyreg", "polr"), maxit = 5,
seed = NA, data.init = NULL, nnodes = 5, path.outfile = NULL, ...)

Arguments

`don.na`	A data frame or a matrix containing the incomplete data. Missing values are coded as `NA`.
`m`	Number of multiple imputations. The default is `m=5`.
`method`	Can be either a single string, or a vector of strings with length `ncol(data)`, specifying the elementary imputation method to be used for each column in data. If specified as a single string, the same method will be used for all columns. The default imputation method (when no argument is specified) depends on the measurement level of the target column and are specified by the `defaultMethod` argument. Columns that need not be imputed have the empty method `''`. See details for more information.
`predictorMatrix`	A square matrix of size `ncol(data)` containing 0/1 data specifying the set of predictors to be used for each target column. Rows correspond to target variables (i.e. variables to be imputed), in the sequence as they appear in data. A value of '1' means that the column variable is used as a predictor for the target variable (in the rows). The diagonal of `predictorMatrix` must be zero. The default for `predictorMatrix` is that all other columns are used as predictors (sometimes called massive imputation). Note: For two-level imputation codes '2' and '-2' are also allowed.
`where`	A data frame or matrix with logicals of the same dimensions as `data` indicating where in the data the imputations should be created. The default, `where = is.na(data)`, specifies that the missing data should be imputed. The `where` argument may be used to overimpute observed data, or to skip imputations for selected missing values.
`visitSequence`	A vector of integers of arbitrary length, specifying the column indices of the visiting sequence. The visiting sequence is the column order that is used to impute the data during one pass through the data. A column may be visited more than once. All incomplete columns that are used as predictors should be visited, or else the function will stop with an error. The default sequence `1:ncol(data)` implies that columns are imputed from left to right. It is possible to specify one of the keywords `'roman'` (left to right), `'arabic'` (right to left), `'monotone'` (sorted in increasing amount of missingness) and `'revmonotone'` (reverse of monotone). The keyword should be supplied as a string and may be abbreviated.
`blots`	A named `list` of `alist`'s that can be used to pass down arguments to lower level imputation function. The entries of element `blots[[blockname]]` are passed down to the function called for block `blockname`.
`post`	A vector of strings with length `ncol(data)`, specifying expressions. Each string is parsed and executed within the `sampler()` function to postprocess imputed values. The default is to do nothing, indicated by a vector of empty strings `''`.
`blocks`	List of vectors with variable names per block. List elements may be named to identify blocks. Variables within a block are imputed by a multivariate imputation method (see `method` argument). By default each variable is placed into its own block, which is effectively fully conditional specification (FCS) by univariate models (variable-by-variable imputation). Only variables whose names appear in `blocks` are imputed. The relevant columns in the `where` matrix are set to `FALSE` of variables that are not block members. A variable may appear in multiple blocks. In that case, it is effectively re-imputed each time that it is visited.
`formulas`	A named list of formula's, or expressions that can be converted into formula's by `as.formula`. List elements correspond to blocks. The block to which the list element applies is identified by its name, so list names must correspond to block names. The `formulas` argument is an alternative to the `predictorMatrix` argument that allows for more flexibility in specifying imputation models, e.g., for specifying interaction terms.
`defaultMethod`	A vector of three strings containing the default imputation methods for numerical columns, factor columns with 2 levels, and columns with (unordered or ordered) factors with more than two levels, respectively. If nothing is specified, the following defaults will be used: `pmm`, predictive mean matching (numeric data) `logreg`, logistic regression imputation (binary data, factor with 2 levels) `polyreg`, polytomous regression imputation for unordered categorical data (factor >= 2 levels) `polr`, proportional odds model for (ordered, >= 2 levels)
`maxit`	A scalar giving the number of iterations. The default is 5.
`seed`	An integer that is used as argument by the `set.seed()` for offsetting the random number generator. Default is to leave the random number generator alone.
`data.init`	A data frame of the same size and type as `data`, without missing data, used to initialize imputations before the start of the iterative process. The default `NULL` implies that starting imputation are created by a simple random draw from the data. Note that specification of `data.init` will start the `m` Gibbs sampling streams from the same imputations.
`nnodes`	A scalar indicating the number of nodes for parallel calculation. Default value is 5.
`path.outfile`	A vector of strings indicating the path for redirection of print messages. Default value is NULL, meaning that silent imputation is performed. Otherwise, print messages are saved in the files path.outfile/output.txt. One file per node is generated.
`...`	Named arguments that are passed down to the elementary imputation functions.

Details

Performs multiple imputation of m tables in parallel by generating m seeds, and then by performing multiple imputation by chained equations in parallel from each one. The output is the same as the mice function of the mice package.

Value

Returns an S3 object of class mids (multiply imputed data set)

Author(s)

Vincent Audigier vincent.audigier@cnam.fr

References

Van Buuren, S., Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. https://www.jstatsoft.org/article/view/v045i03 <doi:10.18637/jss.v045.i03>

van Buuren, S. (2012). Flexible Imputation of Missing Data. Boca Raton, FL: Chapman & Hall/CRC Press.

Van Buuren, S., Brand, J.P.L., Groothuis-Oudshoorn C.G.M., Rubin, D.B. (2006) Fully conditional specification in multivariate imputation. Journal of Statistical Computation and Simulation, 76, 12, 1049–1064. <doi:10.1080/10629360600810434>

Van Buuren, S. (2007) Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16, 3, 219–242. <doi:10.1177/0962280206074463>

Van Buuren, S., Boshuizen, H.C., Knook, D.L. (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine, 18, 681–694. <doi:10.1002/(SICI)1097-0258(19990330)18:6<681::AID-SIM71>3.0.CO;2-R>

Brand, J.P.L. (1999) Development, implementation and evaluation of multiple imputation strategies for the statistical analysis of incomplete data sets. Dissertation. Rotterdam: Erasmus University.

Examples


  ##############
  # nhanes (one level data)
  ##############
  data(nhanes, package = "mice")
  #imp <- mice.par(nhanes)
  #fit <- with(data = imp, exp = lm(bmi ~ hyp + chl))
  #summary(pool(fit))
  
  ##############
  #CHEM97Na (Two levels data with 1681 observations and 5 variables)
  ##############
  
  data(CHEM97Na)
  
  ind.clust<-1#index for the cluster variable
  
  #initialisation of the argument predictorMatrix
  predictor.matrix<-mice(CHEM97Na,m=1,maxit=0)$pred
  predictor.matrix[ind.clust,ind.clust]<-0
  predictor.matrix[-ind.clust,ind.clust]<- -2
  predictor.matrix[predictor.matrix==1]<-2
  
  #initialisation of the argument method
  method<-find.defaultMethod(CHEM97Na,ind.clust)
  
  #multiple imputation by chained equations (parallel calculation) [1 minute]
  #(the imputation process can be followed by opening output.txt files in the working directory)
  #res.mice<-mice.par(CHEM97Na,
  #                  predictorMatrix = predictor.matrix,
  #                  method=method,
  #                  path.outfile=getwd())

  
  #multiple imputation by chained equations (without parallel calculation) [4.8 minutes]
  #res.mice<-mice(CHEM97Na,
  #                  predictorMatrix = predictor.matrix,
  #                  method=method)

  
  
  ############
  #IPDNa (Two levels data with 11685 observations and 10 variables)
  ############
  
  data(IPDNa)
  
  ind.clust<-1#index for the cluster variable

  #initialisation of the argument predictorMatrix
  predictor.matrix<-mice(IPDNa,m=1,maxit=0)$pred
  predictor.matrix[ind.clust,ind.clust]<-0
  predictor.matrix[-ind.clust,ind.clust]<- -2
  predictor.matrix[predictor.matrix==1]<-2

  #initialisation of the argument method
  method<-find.defaultMethod(IPDNa,ind.clust)

  #multiple imputation by chained equations (parallel calculation)

  #res.mice<-mice.par(IPDNa,
  #                 predictorMatrix = predictor.matrix,
  #                 method=method,
  #                 path.outfile=getwd())

[Package micemd version 1.10.0 Index]