ctassoc {omicwas} | R Documentation |
Cell-Type-Specific Association Testing
Description
Cell-Type-Specific Association Testing
Usage
ctassoc(
X,
W,
Y,
C = NULL,
test = "full",
regularize = FALSE,
num.cores = 1,
chunk.size = 1000,
seed = 123
)
Arguments
X |
Matrix (or vector) of traits; samples x traits. |
W |
Matrix of cell type composition; samples x cell types. |
Y |
Matrix (or vector) of bulk omics measurements; markers x samples. |
C |
Matrix (or vector) of covariates; samples x covariates. X, W, Y, C should be numeric. |
test |
Statistical test to apply; either |
regularize |
Whether to apply Tikhonov (ie ridge) regularization
to |
num.cores |
Number of CPU cores to use. Full, marginal and propdiff tests are run in serial, thus num.cores is ignored. |
chunk.size |
The size of job for a CPU core in one batch. If you have many cores but limited memory, and there is a memory failure, decrease num.cores and/or chunk.size. |
seed |
Seed for random number generation. |
Details
Let the indexes be
for cell type,
for sample,
for marker (CpG site or gene),
for each trait that has cell-type-specific effect,
and
for each trait that has a uniform effect across cell types.
The input data are
,
,
and
,
where
can be omitted.
and
are the values for two types of traits,
showing effects that are cell-type-specific or not, respectively.
Thus, calling
and
as "traits" and "covariates"
gives a rough idea, but is not strictly correct.
represents the cell type composition and
represents the marker level,
such as methylation or gene expression.
For each tissue sample, the cell type proportion
is the proportion of each cell type in the bulk tissue,
which is measured or imputed beforehand.
The marker level
in bulk tissue is measured and provided as input.
The parameters we estimate are
the cell-type-specific trait effect ,
the tissue-uniform trait effect
,
and the basal marker level
in each cell type.
We first describe the conventional linear regression models.
For marker in sample
,
the maker level specific to cell type
is
This is a representative value rather than a mean, because we do not model
a probability distribution for cell-type-specific expression.
The bulk tissue marker level is the average weighted by ,
The statistical model is
The error of the marker level is is noramlly distributed with variance
, independently among samples.
The full
model is the linear regression
The marginal
model tests the trait association only in one
cell type , under the linear regression,
The nonlinear model simultaneously analyze cell type composition in
linear scale and differential expression/methylation in log/logit scale.
The normalizing function is the natural logarithm = log for gene
expression, and
= logit for methylation. Conventional linear regression
can be formulated by defining
as the identity function. The three models
are named
nls.log
, nls.logit
and nls.identity
.
We denote the inverse function of by
;
= exp for
gene expression, and
= logistic for methylation.
The mean normalized marker level of marker
in sample
becomes
The statistical model is
The error of the marker level is is noramlly distributed with variance
, independently among samples.
The ridge regression aims to cope with multicollinearity of
the interacting terms .
Ridge regression is fit by minimizing the residual sum of squares (RSS) plus
, where
is the
regularization parameter.
The propdiff tests try to cope with multicollinearity by, roughly speaking,
using mean-centered .
We obtain, instead of
, the deviation of
from the average across cell types.
Accordingly, the null hypothesis changes.
The original null hypothesis was
.
The null hypothesis when centered is
.
It becomes difficult to detect a signal for a major cell type,
because
would be close to the average across cell types.
The tests
propdiff.log
and propdiff.logit
include
an additional preprocessing step that converts to
.
Apart from the preprocessing, the computations are performed in linear scale.
As the preprocessing distorts the linearity between the dependent variable
and (the centered)
,
I actually think
propdiff.identity
is better.
Value
A list with one element, which is named "coefficients".
The element gives the estimate, statistic, p.value in tibble format.
In order to transform the estimate for to the original scale,
apply
plogis
for test = nls.logit
and
exp
for test = nls.log
.
The estimate for by
test = nls.log
is
the natural logarithm of fold-change, not the log2.
If numerical convergence fails, NA
is returned for that marker.
References
Lawless, J. F., & Wang, P. (1976). A simulation study of ridge and other regression estimators. Communications in Statistics - Theory and Methods, 5(4), 307–323. https://doi.org/10.1080/03610927608827353
See Also
ctcisQTL
Examples
data(GSE42861small)
X = GSE42861small$X
W = GSE42861small$W
Y = GSE42861small$Y
C = GSE42861small$C
result = ctassoc(X, W, Y, C = C)
result$coefficients