impute.multivariate.bayesian {miWQS}R Documentation

Multivariate Bayesian Imputation

Description

Given lognormal interval-censored chemical concentrations between zero and different detection limits DL, the chemical concentrations are modelled using Bayesian multivariate regression. Drawing from the posterior predictive density of the BDL chemical concentrations given the observed ones yields multiple (or K) imputed datasets. These datasets are then used in WQS regression.

Usage

impute.multivariate.bayesian(
  X,
  DL,
  Z = NULL,
  K = 5L,
  prior.coeff.mean = NULL,
  prior.cov.mean = NULL,
  T = 250L,
  n.burn = 50L,
  initial = list(NA, NA),
  verbose = FALSE
)

Arguments

X

A numeric vector, matrix, or data-frame of chemical concentration levels with n subjects and C chemicals to be imputed. Missing values are indicated by NA's. Ideally, a numeric matrix.

DL

The detection limit for each chemical as a numeric vector with length equal to C chemicals. Vector must be complete (no NA's); any chemical that has a missing detection limit is not imputed. If DL is a data-frame or matrix with 1 row or 1 column, it is forced as a numeric vector.

Z

Any covariates used in imputing the chemical concentrations. Ideally, a numeric matrix; however, Z can be a factor, vector, or data-frame. Assumed to be complete; observations with missing covariate variables are ignored in the imputation, with a warning printed. If none, enter NULL.

K

A natural number of imputed datasets to generate. Default: 5L.

prior.coeff.mean

The prior mean of number of covariates (p) x C coefficient matrix. The default, entered as NULL, will be a matrix of 1's, given by special.matrix.

prior.cov.mean

The prior mean of covariance matrix. The default, entered as NULL, is an identity matrix with size equal to the number of chemicals, given by special.matrix.

T

Number of total iterations for the Gibbs Sampler. Default: 1000L.

n.burn

The burn-in, which is the number of initial iterations to be discarded. Generally, the burn-in can be quite large as the imputed chemical matrices, X.imputed, are formed from the end of the chain – the lowest state used is T - 10*K. Default: 1L (no burn-in).

initial

An optional two-item list that consists of initial values for the log imputed BDL values vectorized by subject in the Gibbs Sampler. The list contains two elements, one for each chain in the Gibbs Sampler. Each element is a vector of length n0C containing the log imputed BDL values vectorized by subject, (n0 is total # of missing values). If unknown for each chain, enter NA, and the initial values are automatically generated.

verbose

Logical; if TRUE, prints more information. Useful to check for any errors in the code. Default: FALSE.

Value

A list that consists of the following elements:

call

A list of arguments used in this function.

Section - Imputed Dataset (from accessory draw.multi.imputed.samples())

X.imputed

An array of n subjects x C chemicals x K imputed sets on the normal scale. The main result and purpose of the function.

Section - Convergence

convgd.table

A data-frame summarizing convergence with C rows and columns of the Gelman-Rubin statistic and whether the point estimate is less than 1.2. Also printed to the screen.

auto.corr

Summary of autocorrelations of missing data, which are used to justify states taken as imputed datasets. Also printed to screen.

last.states

A list of the last (Tth) states of the imputed values saved to be used for initial values with the first element being from chain1 and second element from chain2.

Section - convgd.surrogates. Surrogates used to check for convergence saved as mcmc.list objects. Returning in case trace plots, autocorrelation plots, etc. wants to be calculated.

eigen.Gamma

An mcmc.list object of the eigenvalues from the p x p Gamma *Gamma^T matrix from the two BURNED chains. The eigenvalues were used as surrogates for the convergence of the coefficient matrix, Gamma.

eigen.Sigma

An mcmc.list object of the eigenvalues for covariance matrix Sigma from two BURNED chains. The eigenvalues were used as surrogates for the convergence of Sigma.

vec.log.X.imputed

An mcmc.list object of the vectorized log imputed chemical values from the two BURNED chains.

Section - Checking Imputation Procedure

indicator.miss

A check; a sum of indicator variables where the number of imputed missing values > detection limit. Should be 0. Printed to screen.

Introduction

We wish to assess the association of the mixture *X* and an outcome *y* while accounting for other covariates *Z*. However, the components in *X* are interval-censored between zero and different detection limits *DL*. The multivariate Bayesian imputation method in the MI-WQS framework (MBMI) jointly imputes the chemical mixture *K* times by taking full advantage of the chemical mixture data.

The logarithmic chemical concentrations *X* are assumed to follow a matrix normal distribution, which is an extension of the multivariate normal:

\log(X)|Z \sim MatNorm( \mu_{i} = z'_{i} \Gamma , \Sigma) , i = 1, ... n

(Iranmanesh et al., 2010). Like other imputation methods in miWQS, we wish to find the posterior predictive density of log(X_miss)|log(X_obs). In impute.multivariate.bayesian(), the missing chemical concentrations are imputed using estimates from a Bayesian Multivariate regression.

Step 1 - Generate a posterior sample

The accessory sample.mregress.impute() function generate a posterior samples using the data augmentation technique. The conjugate priors for a multivariate regression are multivariate extensions of those in the univariate linear regression case. Given complete data, the conjugate priors for the coefficient matrix is another matrix normal with mean prior.coeff.mean, individual variance matrix Z'Z, and chemical variance matrix \Sigma. The prior distribution for the covariance matrix is the inverse-Wishart distribution. In this function, we used a matrix of 1's as the default prior coefficient mean of the matrix normal. The prior parameters chosen for the covariance matrix are vague with the degree of freedoms equal to the number of components, and the mean matrix, by default, is an identity of ones. Instead of attempting to impute a n x C matrix X, we vectorized the logarithmic concentrations by individual, such as: > vec(t(X))
... ...
dieldrin.18 NA
pcb_180.18 -0.2514225
pcb_180.19 -0.2929334
dieldrin.20 -4.4849838
pcb_180.20 -1.0441849
...

The initial missing values were a sample taken from log(uniform(0,DL_j )). If the initial values are set by the user, the initial log imputed values, which is vectorized by subject, has to be n0C x T.

For each step in the data augmentation,

  1. Calculate the MLE, the sample covariance matrix, and the posterior matrix of inverse-Wishart.

  2. Simulate the covariance matrix using inverse Wishart (MCMCpack::riwish()). See InvWishart.

  3. Simulate the coefficient matrix using the matrix normal. See matrixNormal_Distribution.

  4. Impute the vectorized missing log concentrations BDL for each individual from a multivariate normal using current parameter estimates truncated between zero and the detection limits using

Note: The exact MCMC chains are not returned to save computer space.

Step 2 - Assess convergence

To save space, the eigenvalues of the c x c matrix Gamma^T*Gamma and c x c covariance matrix Sigma are saved as surrogates to check for convergence in a Markov Chain.

eigen.Gamma.post

A coda object of eigenvalues taken from a p X C posterior coefficient matrix converted into a square matrix C x C matrix (Gamma^T*Gamma). The eigenvalues as surrogates to check for convergence.

eigen.Sigma.post

A coda object of eigenvalues for covariance matrix (C X C) used as surrogates to check for convergence. The covariance matrix is already square so no conversion is needed.

vec.log.X.imputed

A coda object of the n0 missing values.
Example: The following chemicals shown are those that are missing.
[,1] [,2]
dieldrin.18 -0.7573897 -0.60540942 ...
pcb_180.18 -0.2514225 -1.18717066
pcb_180.19 -0.2929334 -0.01894021
dieldrin.20 -4.4849838 -0.78641994
pcb_180.20 -1.0441849 -0.1349498
...

The accessory converge.multi.chain() function assesses convergence on matrices using the Brook-Gelman’s multivariate potential scale reduction factor (MPSRF). The gelman.diag function calculates the MPSRF on eigenvalues and the vectorized imputed values chain. If the MPSRF is less than 1.24, the stationary distribution of the Markov chains was assumed to occur. The results in convgd.table element are printed to the screen. If at least one chain fails to converge, a warning is printed to occur; in this case, it is suggested to increase T.

Step 3 - Processing MCMC chains

The MCMC chains have already been burned using argument n.burn when generated.

Step 4 - Making imputed value array

The accessory draw.multi.imputed.samples() function forms X.imputed using the posterior predictive distribution of log.miss|log.obs. Using the first MCMC chain of the vectorized log imputed chemical values (vec.log.X.imputed) with length (T), the following states are selected:

t_k=T-(k-1)*10 for k=1,2,…K imputations

The "10" may be justified using autocorrelation summaries, which are printed & returned.

Note

No seed is set in this function. Because bootstraps and data augmentation are random, a seed should be set before every use.

See Also

Other imputation: impute.Lubin(), impute.boot(), impute.sub()

Examples

## Not run: 
#Example takes too long.
system.time({
  set.seed(2345)
  l  <- impute.multivariate.bayesian(
    X =  simdata87$X.bdl[, c(1, 14)], DL = simdata87$DL[c(1, 14)],
    Z =  NULL, T = 200, n.burn = 10, K = 2
  )
})

## End(Not run)

[Package miWQS version 0.4.4 Index]