ctassoc {omicwas}R Documentation

Cell-Type-Specific Association Testing

Description

Cell-Type-Specific Association Testing

Usage

ctassoc(
  X,
  W,
  Y,
  C = NULL,
  test = "full",
  regularize = FALSE,
  num.cores = 1,
  chunk.size = 1000,
  seed = 123
)

Arguments

X

Matrix (or vector) of traits; samples x traits.

W

Matrix of cell type composition; samples x cell types.

Y

Matrix (or vector) of bulk omics measurements; markers x samples.

C

Matrix (or vector) of covariates; samples x covariates. X, W, Y, C should be numeric.

test

Statistical test to apply; either "full", "marginal", "nls.identity", "nls.log", "nls.logit", "propdiff.identity", "propdiff.log", "propdiff.logit" or "reducedrankridge".

regularize

Whether to apply Tikhonov (ie ridge) regularization to \beta_{h j k}. The regularization parameter is chosen automatically according to an unbiased version of (Lawless & Wang, 1976). Effective for nls.* and propdiff.* tests.

num.cores

Number of CPU cores to use. Full, marginal and propdiff tests are run in serial, thus num.cores is ignored.

chunk.size

The size of job for a CPU core in one batch. If you have many cores but limited memory, and there is a memory failure, decrease num.cores and/or chunk.size.

seed

Seed for random number generation.

Details

Let the indexes be h for cell type, i for sample, j for marker (CpG site or gene), k for each trait that has cell-type-specific effect, and l for each trait that has a uniform effect across cell types. The input data are X_{i k}, C_{i l}, W_{i h} and Y_{j i}, where C_{i l} can be omitted. X_{i k} and C_{i l} are the values for two types of traits, showing effects that are cell-type-specific or not, respectively. Thus, calling X_{i k} and C_{i l} as "traits" and "covariates" gives a rough idea, but is not strictly correct. W_{i h} represents the cell type composition and Y_{j i} represents the marker level, such as methylation or gene expression. For each tissue sample, the cell type proportion W_{i h} is the proportion of each cell type in the bulk tissue, which is measured or imputed beforehand. The marker level Y_{j i} in bulk tissue is measured and provided as input.

The parameters we estimate are the cell-type-specific trait effect \beta_{h j k}, the tissue-uniform trait effect \gamma_{j l}, and the basal marker level \alpha_{h j} in each cell type.

We first describe the conventional linear regression models. For marker j in sample i, the maker level specific to cell type h is

\alpha_{h j} + \sum_k \beta_{h j k} * X_{i k}.

This is a representative value rather than a mean, because we do not model a probability distribution for cell-type-specific expression. The bulk tissue marker level is the average weighted by W_{i h},

\mu_{j i} = \sum_h W_{i h} [ \alpha_{h j} + \sum_k \beta_{h j k} * X_{i k} ] + \sum_l \gamma_{j l} C_{i l}.

The statistical model is

Y_{j i} = \mu_{j i} + \epsilon_{j i},

\epsilon_{j i} ~ N(0, \sigma^2_j).

The error of the marker level is is noramlly distributed with variance \sigma^2_j, independently among samples.

The full model is the linear regression

Y_{j i} = (\sum_h \alpha_{h j} * W_{i h}) + (\sum_{h k} \beta_{h j k} * W_{i h} * X_{i k}) + (\sum_l \gamma_{j l} * C_{i l}) + error.

The marginal model tests the trait association only in one cell type h, under the linear regression,

Y_{j i} = (\sum_{h'} \alpha_{h' j} * W_{i h'}) + (\sum_k \beta_{h j k} * W_{i h} * X_{i k}) + (\sum_l \gamma_{j l} * C_{i l}) + error.

The nonlinear model simultaneously analyze cell type composition in linear scale and differential expression/methylation in log/logit scale. The normalizing function is the natural logarithm f = log for gene expression, and f = logit for methylation. Conventional linear regression can be formulated by defining f as the identity function. The three models are named nls.log, nls.logit and nls.identity. We denote the inverse function of f by g; g = exp for gene expression, and g = logistic for methylation. The mean normalized marker level of marker j in sample i becomes

\mu_{j i} = f(\sum_h W_{i h} g( \alpha_{h j} + \sum_k \beta_{h j k} * X_{i k} )) + \sum_l \gamma_{j l} C_{i l}.

The statistical model is

f(Y_{j i}) = \mu_{j i} + \epsilon_{j i},

\epsilon_{j i} ~ N(0, \sigma^2_j).

The error of the marker level is is noramlly distributed with variance \sigma^2_j, independently among samples.

The ridge regression aims to cope with multicollinearity of the interacting terms W_{i h} * X_{i k}. Ridge regression is fit by minimizing the residual sum of squares (RSS) plus \lambda \sum_{h k} \beta_{h j k}^2, where \lambda > 0 is the regularization parameter.

The propdiff tests try to cope with multicollinearity by, roughly speaking, using mean-centered W_{i h}. We obtain, instead of \beta_{h j k}, the deviation of \beta_{h j k} from the average across cell types. Accordingly, the null hypothesis changes. The original null hypothesis was \beta_{h j k} = 0. The null hypothesis when centered is \beta_{h j k} - (\sum_{i h'} W_{i h'} \beta_{h' j k}) / (\sum_{i h'} W_{i h'}) = 0. It becomes difficult to detect a signal for a major cell type, because \beta_{h j k} would be close to the average across cell types. The tests propdiff.log and propdiff.logit include an additional preprocessing step that converts Y_{j i} to f(Y_{j i}). Apart from the preprocessing, the computations are performed in linear scale. As the preprocessing distorts the linearity between the dependent variable and (the centered) W_{i h}, I actually think propdiff.identity is better.

Value

A list with one element, which is named "coefficients". The element gives the estimate, statistic, p.value in tibble format. In order to transform the estimate for \alpha_{h j} to the original scale, apply plogis for test = nls.logit and exp for test = nls.log. The estimate for \beta_{h j k} by test = nls.log is the natural logarithm of fold-change, not the log2. If numerical convergence fails, NA is returned for that marker.

References

Lawless, J. F., & Wang, P. (1976). A simulation study of ridge and other regression estimators. Communications in Statistics - Theory and Methods, 5(4), 307–323. https://doi.org/10.1080/03610927608827353

See Also

ctcisQTL

Examples


data(GSE42861small)
X = GSE42861small$X
W = GSE42861small$W
Y = GSE42861small$Y
C = GSE42861small$C
result = ctassoc(X, W, Y, C = C)
result$coefficients



[Package omicwas version 0.8.0 Index]