ctassoc {omicwas} | R Documentation |
Cell-Type-Specific Association Testing
Description
Cell-Type-Specific Association Testing
Usage
ctassoc(
X,
W,
Y,
C = NULL,
test = "full",
regularize = FALSE,
num.cores = 1,
chunk.size = 1000,
seed = 123
)
Arguments
X |
Matrix (or vector) of traits; samples x traits. |
W |
Matrix of cell type composition; samples x cell types. |
Y |
Matrix (or vector) of bulk omics measurements; markers x samples. |
C |
Matrix (or vector) of covariates; samples x covariates. X, W, Y, C should be numeric. |
test |
Statistical test to apply; either |
regularize |
Whether to apply Tikhonov (ie ridge) regularization
to |
num.cores |
Number of CPU cores to use. Full, marginal and propdiff tests are run in serial, thus num.cores is ignored. |
chunk.size |
The size of job for a CPU core in one batch. If you have many cores but limited memory, and there is a memory failure, decrease num.cores and/or chunk.size. |
seed |
Seed for random number generation. |
Details
Let the indexes be
h
for cell type, i
for sample,
j
for marker (CpG site or gene),
k
for each trait that has cell-type-specific effect,
and l
for each trait that has a uniform effect across cell types.
The input data are X_{i k}
, C_{i l}
, W_{i h}
and Y_{j i}
,
where C_{i l}
can be omitted.
X_{i k}
and C_{i l}
are the values for two types of traits,
showing effects that are cell-type-specific or not, respectively.
Thus, calling X_{i k}
and C_{i l}
as "traits" and "covariates"
gives a rough idea, but is not strictly correct.
W_{i h}
represents the cell type composition and
Y_{j i}
represents the marker level,
such as methylation or gene expression.
For each tissue sample, the cell type proportion W_{i h}
is the proportion of each cell type in the bulk tissue,
which is measured or imputed beforehand.
The marker level Y_{j i}
in bulk tissue is measured and provided as input.
The parameters we estimate are
the cell-type-specific trait effect \beta_{h j k}
,
the tissue-uniform trait effect \gamma_{j l}
,
and the basal marker level \alpha_{h j}
in each cell type.
We first describe the conventional linear regression models.
For marker j
in sample i
,
the maker level specific to cell type h
is
\alpha_{h j} + \sum_k \beta_{h j k} * X_{i k}.
This is a representative value rather than a mean, because we do not model
a probability distribution for cell-type-specific expression.
The bulk tissue marker level is the average weighted by W_{i h}
,
\mu_{j i} = \sum_h W_{i h} [ \alpha_{h j} + \sum_k \beta_{h j k} * X_{i k} ] +
\sum_l \gamma_{j l} C_{i l}.
The statistical model is
Y_{j i} = \mu_{j i} + \epsilon_{j i},
\epsilon_{j i} ~ N(0, \sigma^2_j).
The error of the marker level is is noramlly distributed with variance
\sigma^2_j
, independently among samples.
The full
model is the linear regression
Y_{j i} = (\sum_h \alpha_{h j} * W_{i h}) +
(\sum_{h k} \beta_{h j k} * W_{i h} * X_{i k}) +
(\sum_l \gamma_{j l} * C_{i l}) +
error.
The marginal
model tests the trait association only in one
cell type h
, under the linear regression,
Y_{j i} = (\sum_{h'} \alpha_{h' j} * W_{i h'}) +
(\sum_k \beta_{h j k} * W_{i h} * X_{i k}) +
(\sum_l \gamma_{j l} * C_{i l}) +
error.
The nonlinear model simultaneously analyze cell type composition in
linear scale and differential expression/methylation in log/logit scale.
The normalizing function is the natural logarithm f
= log for gene
expression, and f
= logit for methylation. Conventional linear regression
can be formulated by defining f
as the identity function. The three models
are named nls.log
, nls.logit
and nls.identity
.
We denote the inverse function of f
by g
; g
= exp for
gene expression, and g
= logistic for methylation.
The mean normalized marker level of marker j
in sample i
becomes
\mu_{j i} = f(\sum_h W_{i h} g( \alpha_{h j} + \sum_k \beta_{h j k} * X_{i k} )) +
\sum_l \gamma_{j l} C_{i l}.
The statistical model is
f(Y_{j i}) = \mu_{j i} + \epsilon_{j i},
\epsilon_{j i} ~ N(0, \sigma^2_j).
The error of the marker level is is noramlly distributed with variance
\sigma^2_j
, independently among samples.
The ridge regression aims to cope with multicollinearity of
the interacting terms W_{i h} * X_{i k}
.
Ridge regression is fit by minimizing the residual sum of squares (RSS) plus
\lambda \sum_{h k} \beta_{h j k}^2
, where \lambda > 0
is the
regularization parameter.
The propdiff tests try to cope with multicollinearity by, roughly speaking,
using mean-centered W_{i h}
.
We obtain, instead of \beta_{h j k}
, the deviation of
\beta_{h j k}
from the average across cell types.
Accordingly, the null hypothesis changes.
The original null hypothesis was \beta_{h j k} = 0
.
The null hypothesis when centered is
\beta_{h j k} - (\sum_{i h'} W_{i h'} \beta_{h' j k}) / (\sum_{i h'} W_{i h'}) = 0
.
It becomes difficult to detect a signal for a major cell type,
because \beta_{h j k}
would be close to the average across cell types.
The tests propdiff.log
and propdiff.logit
include
an additional preprocessing step that converts Y_{j i}
to f(Y_{j i})
.
Apart from the preprocessing, the computations are performed in linear scale.
As the preprocessing distorts the linearity between the dependent variable
and (the centered) W_{i h}
,
I actually think propdiff.identity
is better.
Value
A list with one element, which is named "coefficients".
The element gives the estimate, statistic, p.value in tibble format.
In order to transform the estimate for \alpha_{h j}
to the original scale,
apply plogis
for test = nls.logit
and
exp
for test = nls.log
.
The estimate for \beta_{h j k}
by test = nls.log
is
the natural logarithm of fold-change, not the log2.
If numerical convergence fails, NA
is returned for that marker.
References
Lawless, J. F., & Wang, P. (1976). A simulation study of ridge and other regression estimators. Communications in Statistics - Theory and Methods, 5(4), 307–323. https://doi.org/10.1080/03610927608827353
See Also
ctcisQTL
Examples
data(GSE42861small)
X = GSE42861small$X
W = GSE42861small$W
Y = GSE42861small$Y
C = GSE42861small$C
result = ctassoc(X, W, Y, C = C)
result$coefficients