R: A multivariate mixed-effects selection model with correlated...

mvMISE_b {mvMISE}

R Documentation

A multivariate mixed-effects selection model with correlated outcome-specific random intercepts

Description

This function fits a multivariate mixed-effects selection model with correlated outcome-specific random intercepts allowing potential ignorable or non-ignorable missing values in the outcome. Here an outcome refers to a response variable, for example, a genomic feature. The proposed model and function jointly analyze multiple outcomes/features.

Usage

mvMISE_b(Y, X, id, maxIter = 100, tol = 0.001, verbose = FALSE, cov_miss = NULL, 
    miss_y = TRUE, sigma_diff = FALSE)

Arguments

`Y`	an outcome matrix. Each row is a sample, and each column is an outcome variable, with potential missing values (NAs).
`X`	a covariate matrix. Each row is a sample, and each column is a covariate. The covariates can be common among all of the outcomes (e.g., age, gender) or outcome-specific. If a covariate is specific for the k-th outcome, one may set all the values corresponding to the other outcomes to be zero. If X is common across outcomes, the row number of X equals the row number of Y. Otherwise, if X is outcome-specific, the row number of X equals the number of elements in Y, i.e., outcome-specific X is stacked across outcomes within each cluster. See the Examples for demonstration.
`id`	a vector of cluster/batch index, matching with the rows of Y, and X if it is not outcome specific.
`maxIter`	the maximum number of iterations for the EM algorithm.
`tol`	the tolerance level for the relative change in the observed-data log-likelihood function.
`verbose`	logical. If TRUE, the iteration history of each step of the EM algorithm will be printed. The default is FALSE.
`cov_miss`	the covariate that can be used in the missing-data model. If it is NULL, the missingness is assumed to be independent of the covariates. Check the Details for the missing-data model. If it is specified and the covariate is not outcome specific, its length equals the length of id. If it is outcome specific, the outcome-specific covariate is stacked across outcomes within each cluster.
`miss_y`	logical. If TRUE, the missingness depends on the outcome Y (see the Details). The default is TRUE. This outcome-dependent missing data pattern was motivated by and was observed in the mass-spectrometry-based quantitative proteomics data.
`sigma_diff`	logical. If TRUE, the sample error variance of the first sample in each cluster/batch is different from that for the rest of samples within the same cluster/batch. This option is designed and used when analyzing batch-processed proteomics data with the first sample in each cluster/batch being the common reference sample. The default is FALSE.

Details

The multivariate mixed-effects selection model consists of two components, the outcome model and the missing-data model. Here the outcome model is a multivariate mixed-effects model, with correlations among multivariate outcomes modeled via correlated outcome-specific random intercepts with a factor-analytic structure

\mathbf{y}_{i} = \mathbf{X}_{i}\boldsymbol{\beta} + \left(\mathbf{I}_{K}\otimes\mathbf{1}_{n_{i}}\right) \boldsymbol{\tau}b_{i}+\mathbf{e}_{i},

where i denotes a cluster/batch, n_{i} is the number of samples/observations within each cluster, \boldsymbol{\tau} is a K\times 1 vector for the outcome-specific variance components corresponding to the random effect b_i (a standard normal random variable), and K is the number of outcomes. By default, a matrix with each column as an indicator for each outcome is generated and is used as the random-effect design matrix (\mathbf{I}_{K}\otimes\mathbf{1}_{n_{i}}), and the model will estimate the outcome-specific random intercepts. The factor-analytic structure assumes the outcome-specific random intercepts are identically correlated and this model is often used to capture the highly structured experimental or biological correlations among naturally related outcomes. For example, the correlation among multiple phosphopeptides (i.e. phosphorylated segments) of a same protein. The model assumes that the random effects are derived from a latent variable b_i with a loading vector \boldsymbol{\tau}. With this model specification, only K parameters instead of K(K+1)/2 are needed in the estimation for the covariance matrix of random-effects, and as such that greatly facilitates the computation.

The missing-data model can be written as

\textrm{Pr}\left(r_{ik}=1|\mathbf{y}_{ik}\right)= \mathrm{exp}\left(\phi_{0} + \phi_{1}/n_{i}\cdot \mathbf{1}^{'}\mathbf{y}_{ik} + \phi_{2}/n_{i}\cdot \mathbf{1}^{'}\mathbf{c}_{i} \right),

where r_{ik} is the missing indicator for the k-th outcome in the i-th cluster. If r_{ik}=1, the values of the k-th outcome in the i-th cluster \mathbf{y}_{ik} are missing altogether. The estimation is implemented via an EM algorithm. Parameters in the missing-data models can be specified via the arguments miss_y and cov_miss. If miss_y = TURE, the missingness depends on the outcome values. If cov_miss is specified, the missingness can (additionally) depend on the specified covariate (cov_miss).

The model also works for fully observed data if miss_y = FALSE and cov_miss = NULL. It would also work for a univariate outcome with potential missing values, if the outcome Y is a matrix with one column.

Value

A list containing

`beta`	the estimated fixed-effects.
`var`	the variance-covariance matrix of the estimated fixed effects. With the fixed effects and their covariance matrix estimates, one can obtain the Wald-statistics for testing fixed-effects beta/sqrt(diag(var)).
`pval`	the parametric p-values for testing non-zero fixed-effects. It is obtained as the two-sided p-value based on the Wald statistics of beta/sqrt(diag(var)).
`sigma2`	the estimated sample error variance(s). If sigma_diff is TRUE, it returns a vector of two elements, the variances for the first sample and for the rest of samples within each cluster.
`tau`	the estimated variance components for the outcome-specific factor-analytic random-effects.
`phi`	the estimated parameters for the missing-data mechanism. Check the Details for the missing-data model. A zero estimate implies that the parameter is ignored via the specification of miss_y and/or cov_miss.
`loglikelihood`	the observed-data log-likelihood values.
`iter`	the number of iterations for the EM algorithm when reaching the convergence.

References

Jiebiao Wang, Pei Wang, Donald Hedeker, and Lin S. Chen. Using multivariate mixed-effects selection models for analyzing batch-processed proteomics data with non-ignorable missingness. Biostatistics. doi:10.1093/biostatistics/kxy022

Examples

data(sim_dat)

# Covariates X common across outcomes with common coefficients

fit0 = mvMISE_b(Y = sim_dat$Y, X = sim_dat$X, id = sim_dat$id)



# In the example below, we showed how to estimate outcome-specific
# coefficients for a common covariate. The second column of
# sim_dat$X matrix is a common covariate. But it has different
# effects/coefficients on different outcomes.

nY = ncol(sim_dat$Y)
# stack X across outcomes
X_mat = sim_dat$X[rep(1:nrow(sim_dat$X), nY), ]
# Y_ind is the indicator matrix corresponding to different outcomes
Y_ind = kronecker(diag(nY), rep(1, nrow(sim_dat$Y)))
# generate outcome-specific covariates
cidx = 2  # the index for the covariate with outcome-specific coefficient
X_mat = cbind(1, X_mat[, cidx] * Y_ind)

# X_mat is a matrix of 460 (92*5) by 6, the first column is
# intercept and the next 5 columns are covariate for each outcome

fit1 = mvMISE_b(Y = sim_dat$Y, X = X_mat, id = sim_dat$id)


# A covariate only specific to the first outcome

X_mat1 = X_mat[, 1:2]

fit2 = mvMISE_b(Y = sim_dat$Y, X = X_mat1, id = sim_dat$id)


## An example that allows missingness depending on both a covariate
## and the outcome

fit3 = mvMISE_e(Y = sim_dat$Y, X = sim_dat$X, id = sim_dat$id, 
    cov_miss = sim_dat$X[, 2])

[Package mvMISE version 1.0 Index]