mvMISE_e {mvMISE}R Documentation

A multivariate mixed-effects selection model with correlated outcome-specific error terms

Description

This function fits a multivariate mixed-effects selection model with correlated outcome-specific error terms and potential missing values in the outcome. Here an outcome refers to a response variable, for example, a genomic feature. The proposed model and function jointly analyze multiple outcomes/features. For high-dimensional outcomes, the model can regularize the estimation by shrinking the error precision matrix with a graphical lasso penalty. Given the introduction of the penalty and the choice of tuning parameter often being data-dependant, we recommend using permutation to calculate p-values for testing with the mvMISE_e model. Please see mvMISE_e_perm for calculating the permutation-based p-values.

Usage

mvMISE_e(Y, X, id, Zidx = 1, maxIter = 100, tol = 0.001, lambda = NULL, ADMM = TRUE, 
    verbose = FALSE, cov_miss = NULL, miss_y = NULL, sigma_diff = FALSE)

Arguments

Y

an outcome matrix. Each row is a sample, and each column is an outcome variable, with potential missing values (NAs).

X

a covariate matrix. Each row is a sample, and each column is a covariate. The covariates can be common among all of the outcomes (e.g., age, gender) or outcome-specific. If a covariate is specific for the k-th outcome, one may set all the values corresponding to the other outcomes to be zero. If X is common across outcomes, the row number of X equals the row number of Y. Otherwise if X is outcome-specific, the row number of X equals the number of elements in Y, i.e., outcome-specific X is stacked across outcomes. See the Examples for demonstration.

id

a vector for cluster/batch index, matching with the rows of Y, and X if it is not outcome specific.

Zidx

the column indices of matrix X used as the design matrix of random effects. The default is 1, i.e., a random intercept is included if the first column of X is a vector of 1s. If Zidx=c(1,2), then the model would estimate the random intercept and the random effects of the 2nd column in the covariate matrix X. The random-effects in this model are assumed to be independent.

maxIter

the maximum number of iterations for the EM algorithm.

tol

the tolerance level for the relative change in the observed-data log-likelihood function.

lambda

the tuning parameter for the graphical lasso penalty of the error precision matrix. It can be selected by AIC (an output). The default is sqrt(log(ncol(Y))/nrow(Y)).

ADMM

logical. If TRUE (the default), we impose a L1 graphical lasso penalty on the error precision (inverse of covariance) matrix, and the alternating direction method of multipliers (ADMM) is used to estimate the error precision and the error covariance matrix. If FALSE, no penalty is used to estimate the unstructured error covariance matrix, and that is only applicable to low-dimensional multivariate outcomes. For an univariate outcome, it should be set as FALSE.

verbose

logical. If TRUE, the iteration history of each step of the EM algorithm will be printed. The default is FALSE.

cov_miss

the covariate that can be used in the missing-data model. If it is NULL, the missingness is assumed to be independent of the covariates. Check the Details for the missing-data model. If it is specified and the covariate is not outcome specific, its length equals the length of id. If it is outcome specific, the outcome-specific covariate is stacked across outcomes within each cluster.

miss_y

logical. If TRUE, the missingness depends on the outcome Y (see the Details). The default is TRUE if the average missing rate is greater than 5%, otherwise is FALSE. This outcome-dependent missing data pattern was motivated by and was observed in the mass-spectrometry-based quantitative proteomics data.

sigma_diff

logical. If TRUE, the sample error variance of the first sample is different from that for the rest of samples within each cluster. This option is designed and used when analyzing batch-processed proteomics data with the first sample in each cluster/batch being the common reference sample. The default is FALSE.

Details

The multivariate mixed-effects selection model consists of two components, the outcome model and the missing-data model. Here the outcome model is a multivariate mixed-effects model. The correlations among multivariate outcomes are modeled via outcome-specific error terms with an unstructured covariance matrix. For the i-th cluster, the outcome matrix \mathbf{Y}_{i} is a matrix of n_i samples (rows) and K outcomes (columns). Let \mathbf{y}_{i} = \mathrm{vec}\left( \mathbf{Y}_{i} \right). The outcome vector \mathbf{y}_{i} can be modelled as

\mathbf{y}_{i} = \mathbf{X}_{i}\boldsymbol{\beta}+\mathbf{Z}_{i}\mathbf{b}_{i}+\mathbf{e}_{i},

where the random effects (\mathbf{b}_{i}) follow a normal distribution \mathbf{b}_{i}\sim N(\mathbf{0},\mathbf{D}); and the error term \mathbf{e}_{i}=\mathrm{vec}\left(\mathbf{E}_{i}\right) \sim N(\mathbf{0},\boldsymbol{\Sigma}\otimes\mathbf{S}_{i}). The matrix \mathbf{S}_{i} is an n_i\times n_i diagonal matrix with diagonal elements corresponding to the error variances of the n_i samples within the i-th cluster. The variances for the first and other samples can be different if sigma_diff = TRUE. The matrix \boldsymbol{\Sigma} captures the error (or unexplained) covariances among the K outcomes. For high-dimensional outcomes, if ADMM = TRUE (the default), the off-diagonal elements of the inverse of \boldsymbol{\Sigma} will be shrinked by a graphical lasso penalty and the alternating direction method of multipliers (ADMM) is used to estimate \boldsymbol{\Sigma}. If ADMM = FALSE, no penalty is used to estimate the unstructured error covariance matrix, and that is only applicable to low-dimensional multivariate outcomes.

The missing-data model can be written as

\textrm{Pr}\left(r_{ik}=1|\mathbf{y}_{ik}\right)= \mathrm{exp}\left(\phi_{0} + \phi_{1}/n_{i}\cdot \mathbf{1}^{'}\mathbf{y}_{ik} + \phi_{2}/n_{i}\cdot \mathbf{1}^{'}\mathbf{c}_{i} \right),

where r_{ik} is the missing indicator for the k-th outcome in the i-th cluster. If missing r_{ik}=1, the k-th outcome in the i-th cluster \mathbf{y}_{ik} is missing altogether. The estimation is implemented within an EM algorithm framework. Parameters in the missing-data models can be specified via the arguments miss_y and cov_miss. If miss_y = TURE, the missingness depends on the outcome values. If cov_miss is specified, the missingness can (additionally) depend on the specified covariates (cov_miss).

The model also works for fully observed data if miss_y = FALSE and cov_miss = NULL. It would also work for an univariate outcome with potential missing values, if the outcome Y is a matrix with one column.

Value

A list containing

beta

the estimated fixed-effects.

stat

the parametric Wald statistics for testing non-zero fixed-effects. It is used in permutation tests.

Sigma

the estimated error covariance matrix for the outcomes.

sigma2

the estimated sample error variance(s). If sigma_diff is TRUE, it returns a vector of two elements, the variances for the first sample and the rest of samples within each cluster.

D

the estimated covariance matrix for the random-effects.

phi

the estimated parameters for the missing-data mechanism. Check the Details for the missing-data model. A zero value implies that parameter is ignored via the specification of miss_y and cov_miss.

loglikelihood

the observed-data log-likelihood values.

iter

the number of iterations for the EM algorithm when reaching the convergence.

AIC

The Akaike information criterion (AIC) calculated for selecting the tuning parameter lambda of the graphical lasso penalty.

References

Jiebiao Wang, Pei Wang, Donald Hedeker, and Lin S. Chen. Using multivariate mixed-effects selection models for analyzing batch-processed proteomics data with non-ignorable missingness. Biostatistics. doi:10.1093/biostatistics/kxy022

Examples

data(sim_dat)

# Covariates X common across outcomes with common coefficients

fit0 = mvMISE_e(Y = sim_dat$Y, X = sim_dat$X, id = sim_dat$id)



# In the example below, we showed how to estimate outcome-specific
# coefficients for a common covariate. The second column of
# sim_dat$X matrix is a common covariate. But it has different
# effects/coefficients on different outcomes.

nY = ncol(sim_dat$Y)
# stack X across outcomes
X_mat = sim_dat$X[rep(1:nrow(sim_dat$X), nY), ]
# Y_ind is the indicator matrix corresponding to different outcomes
Y_ind = kronecker(diag(nY), rep(1, nrow(sim_dat$Y)))
# generate outcome-specific covariates
cidx = 2  # the index for the covariate with outcome-specific coefficient
X_mat = cbind(1, X_mat[, cidx] * Y_ind)

# X_mat is a matrix of 460 (92*5) by 6, the first column is
# intercept and the next 5 columns are covariate for each outcome

fit1 = mvMISE_e(Y = sim_dat$Y, X = X_mat, id = sim_dat$id)


# A covariate only specific to the first outcome

X_mat1 = X_mat[, 1:2]

fit2 = mvMISE_e(Y = sim_dat$Y, X = X_mat1, id = sim_dat$id)


## An example to allow missingness to depend on both a covariate and
## the outcome

fit3 = mvMISE_e(Y = sim_dat$Y, X = sim_dat$X, id = sim_dat$id, 
    cov_miss = sim_dat$X[, 2])



[Package mvMISE version 1.0 Index]