smcfcs {smcfcs}R Documentation

Substantive model compatible fully conditional specification imputation of covariates.

Description

Multiply imputes missing covariate values using substantive model compatible fully conditional specification.

Usage

smcfcs(
  originaldata,
  smtype,
  smformula,
  method,
  predictorMatrix = NULL,
  m = 5,
  numit = 10,
  rjlimit = 1000,
  noisy = FALSE,
  errorProneMatrix = NULL
)

Arguments

originaldata

The original data frame with missing values.

smtype

A string specifying the type of substantive model. Possible values are "lm", "logistic", "brlogistic", "poisson", "weibull", "coxph", "compet".

smformula

The formula of the substantive model. For "weibull" and "coxph" substantive models the left hand side should be of the form "Surv(t,d)". For "compet" substantive models, a list should be passed consisting of the Cox models for each cause of failure (see example).

method

A required vector of strings specifying for each variable either that it does not need to be imputed (""), the type of regression model to be be used to impute. Possible values are "norm" (normal linear regression), "logreg" (logistic regression), "brlogreg" (bias reduced logistic regression), "poisson" (Poisson regression), "podds" (proportional odds regression for ordered categorical variables), "mlogit" (multinomial logistic regression for unordered categorical variables), or a custom expression which defines a passively imputed variable, e.g. "x^2" or "x1*x2". "latnorm" indicates the variable is a latent normal variable which is measured with error. If this is specified for a variable, the "errorProneMatrix" argument should also be used.

predictorMatrix

An optional predictor matrix. If specified, the matrix defines which covariates will be used as predictors in the imputation models (the outcome must not be included). The i'th row of the matrix should consist of 0s and 1s, with a 1 in the j'th column indicating the j'th variable be used as a covariate when imputing the i'th variable. If not specified, when imputing a given variable, the imputation model covariates are the other covariates of the substantive model which are partially observed (but which are not passively imputed) and any fully observed covariates (if present) in the substantive model. Note that the outcome variable is implicitly conditioned on by the rejection sampling scheme used by smcfcs, and should not be specified as a predictor in the predictor matrix.

m

The number of imputed datasets to generate. The default is 5.

numit

The number of iterations to run when generating each imputation. In a (limited) range of simulations good performance was obtained with the default of 10 iterations. However, particularly when the proportion of missingness is large, more iterations may be required for convergence to stationarity.

rjlimit

Specifies the maximum number of attempts which should be made when using rejection sampling to draw from imputation models. If the limit is reached when running a warning will be issued. In this case it is probably advisable to increase the rjlimit until the warning does not appear.

noisy

logical value (default FALSE) indicating whether output should be noisy, which can be useful for debugging or checking that models being used are as desired.

errorProneMatrix

An optional matrix which if specified indicates that some variables are measured with classical measurement error. If the i'th variable is measured with error by variables j and k, then the (i,j) and (i,k) entries of this matrix should be 1, with the remainder of entries 0. The i'th element of the method argument should then be specified as "latnorm". See the measurement error vignette for more details.

Details

smcfcs imputes missing values of covariates using the Substantive Model Compatible Fully Conditional Specification multiple imputation approach proposed by Bartlett et al 2015 (see references).

Imputation is supported for linear regression ("lm"), logistic regression ("logistic"), bias reduced logistic regression ("brlogistic"), Poisson regression ("poisson"), Weibull ("weibull") and Cox regression for time to event data ("coxph"), and Cox models for competing risks data ("compet"). For "coxph", the event indicator should be integer coded with 0 for censoring and 1 for event. For "compet", a Cox model is assumed for each cause specific hazard function, and the event indicator should be integer coded with 0 corresponding to censoring, 1 corresponding to failure from the first cause etc.

The function returns a list. The first element impDataset of the list is a list of the imputed datasets. Models (e.g. the substantive model) can be fitted to each and results combined using Rubin's rules using the mitools package, as illustrated in the examples.

The second element smCoefIter is a three dimensional array containing the values of the substantive model parameters obtained at the end of each iteration of the algorithm. The array is indexed by: imputation number, parameter number, iteration.

If the substantive model is linear, logistic or Poisson regression, smcfcs will automatically impute missing outcomes, if present, using the specified substantive model. However, even in this case, the user should specify "" in the element of method corresponding to the outcome variable.

The bias reduced methods make use of the brglm2 package to fit the corresponding glms using Firth's bias reduced approach. These may be particularly useful to use in case of perfect prediction, since the resulting model estimates are always guaranteed to be finite, even in the case of perfect prediction.

The development of this package was supported by the UK Medical Research Council (Fellowship MR/K02180X/1 and grant MR/T023953/1). Part of its development took place while Bartlett was kindly hosted by the University of Michigan's Department of Biostatistics & Institute for Social Research.

The structure of many of the arguments to smcfcs are based on those of the excellent mice package.

Value

A list containing:

impDatasets a list containing the imputed datasets

smCoefIter a three dimension matrix containing the substantive model parameter values. The matrix is indexed by [imputation,parameter number,iteration]

Author(s)

Jonathan Bartlett jonathan.bartlett1@lshtm.ac.uk

References

Bartlett JW, Seaman SR, White IR, Carpenter JR. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Statistical Methods in Medical Research 2015; 24(4): 462-487. doi:10.1177/0962280214521348

Examples

#set random number seed to make results reproducible
set.seed(123)

#linear substantive model with quadratic covariate effect
imps <- smcfcs(ex_linquad, smtype="lm", smformula="y~z+x+xsq",
               method=c("","","norm","x^2",""))

#if mitools is installed, fit substantive model to imputed datasets
#and combine results using Rubin's rules
if (requireNamespace("mitools", quietly = TRUE)) {
  library(mitools)
  impobj <- imputationList(imps$impDatasets)
  models <- with(impobj, lm(y~z+x+xsq))
  summary(MIcombine(models))
}

#the following examples are not run when the package is compiled on CRAN
#(to keep computation time down), but they can be run by package users
## Not run: 
  #examining convergence, using 100 iterations, setting m=1
  imps <- smcfcs(ex_linquad, smtype="lm", smformula="y~z+x+xsq",
                 method=c("","","norm","x^2",""),m=1,numit=100)
  #convergence plot from first imputation for third coefficient of substantive model
  plot(imps$smCoefIter[1,3,])

  #include auxiliary variable assuming it is conditionally independent of Y (which it is here)
  predMatrix <- array(0, dim=c(ncol(ex_linquad),ncol(ex_linquad)))
  predMatrix[3,] <- c(0,1,0,0,1)
  imps <- smcfcs(ex_linquad, smtype="lm", smformula="y~z+x+xsq",
                 method=c("","","norm","x^2",""),predictorMatrix=predMatrix)

  #impute missing x1 and x2, where they interact in substantive model
  imps <- smcfcs(ex_lininter, smtype="lm", smformula="y~x1+x2+x1*x2",
                 method=c("","norm","logreg"))

  #logistic regression substantive model, with quadratic covariate effects
  imps <- smcfcs(ex_logisticquad, smtype="logistic", smformula="y~z+x+xsq",
                 method=c("","","norm","x^2",""))

  #Poisson regression substantive model
  imps <- smcfcs(ex_poisson, smtype="poisson", smformula="y~x+z",
                 method=c("","norm",""))
  if (requireNamespace("mitools", quietly = TRUE)) {
    library(mitools)
    impobj <- imputationList(imps$impDatasets)
    models <- with(impobj, glm(y~x+z,family=poisson))
    summary(MIcombine(models))
  }

  #Cox regression substantive model, with only main covariate effects
  if (requireNamespace("survival", quietly = TRUE)) {
    imps <- smcfcs(ex_coxquad, smtype="coxph", smformula="Surv(t,d)~z+x+xsq",
                   method=c("","","","norm","x^2",""))

    #competing risks substantive model, with only main covariate effects
    imps <- smcfcs(ex_compet, smtype="compet",
                   smformula=c("Surv(t,d==1)~x1+x2", "Surv(t,d==2)~x1+x2"),
                   method=c("","","logreg","norm"))
  }

  #if mitools is installed, fit model for first competing risk
  if (requireNamespace("mitools", quietly = TRUE)) {
    library(mitools)
    impobj <- imputationList(imps$impDatasets)
    models <- with(impobj, coxph(Surv(t,d==1)~x1+x2))
    summary(MIcombine(models))
  }

  #discrete time survival analysis example
  M <- 5
  imps <- smcfcs(ex_dtsam, "dtsam", "Surv(failtime,d)~x1+x2",
                 method=c("logreg","", "", ""),m=M)
  #fit dtsam model to each dataset manually, since we need
  #to expand to person-period data form first
  ests <- vector(mode = "list", length = M)
  vars <- vector(mode = "list", length = M)
  for (i in 1:M) {
    longData <- survSplit(Surv(failtime,d)~x1+x2, data=imps$impDatasets[[i]],
                          cut=unique(ex_dtsam$failtime[ex_dtsam$d==1]))
    mod <- glm(d~-1+factor(tstart)+x1+x2, family="binomial", data=longData)
    ests[[i]] <- coef(mod)
    vars[[i]] <- diag(vcov(mod))
  }
  summary(MIcombine(ests,vars))


## End(Not run)

[Package smcfcs version 1.8.0 Index]