| ctassoc {omicwas} | R Documentation |
Cell-Type-Specific Association Testing
Description
Cell-Type-Specific Association Testing
Usage
ctassoc(
X,
W,
Y,
C = NULL,
test = "full",
regularize = FALSE,
num.cores = 1,
chunk.size = 1000,
seed = 123
)
Arguments
X |
Matrix (or vector) of traits; samples x traits. |
W |
Matrix of cell type composition; samples x cell types. |
Y |
Matrix (or vector) of bulk omics measurements; markers x samples. |
C |
Matrix (or vector) of covariates; samples x covariates. X, W, Y, C should be numeric. |
test |
Statistical test to apply; either |
regularize |
Whether to apply Tikhonov (ie ridge) regularization
to |
num.cores |
Number of CPU cores to use. Full, marginal and propdiff tests are run in serial, thus num.cores is ignored. |
chunk.size |
The size of job for a CPU core in one batch. If you have many cores but limited memory, and there is a memory failure, decrease num.cores and/or chunk.size. |
seed |
Seed for random number generation. |
Details
Let the indexes be
h for cell type, i for sample,
j for marker (CpG site or gene),
k for each trait that has cell-type-specific effect,
and l for each trait that has a uniform effect across cell types.
The input data are X_{i k}, C_{i l}, W_{i h} and Y_{j i},
where C_{i l} can be omitted.
X_{i k} and C_{i l} are the values for two types of traits,
showing effects that are cell-type-specific or not, respectively.
Thus, calling X_{i k} and C_{i l} as "traits" and "covariates"
gives a rough idea, but is not strictly correct.
W_{i h} represents the cell type composition and
Y_{j i} represents the marker level,
such as methylation or gene expression.
For each tissue sample, the cell type proportion W_{i h}
is the proportion of each cell type in the bulk tissue,
which is measured or imputed beforehand.
The marker level Y_{j i} in bulk tissue is measured and provided as input.
The parameters we estimate are
the cell-type-specific trait effect \beta_{h j k},
the tissue-uniform trait effect \gamma_{j l},
and the basal marker level \alpha_{h j} in each cell type.
We first describe the conventional linear regression models.
For marker j in sample i,
the maker level specific to cell type h is
\alpha_{h j} + \sum_k \beta_{h j k} * X_{i k}.
This is a representative value rather than a mean, because we do not model
a probability distribution for cell-type-specific expression.
The bulk tissue marker level is the average weighted by W_{i h},
\mu_{j i} = \sum_h W_{i h} [ \alpha_{h j} + \sum_k \beta_{h j k} * X_{i k} ] +
\sum_l \gamma_{j l} C_{i l}.
The statistical model is
Y_{j i} = \mu_{j i} + \epsilon_{j i},
\epsilon_{j i} ~ N(0, \sigma^2_j).
The error of the marker level is is noramlly distributed with variance
\sigma^2_j, independently among samples.
The full model is the linear regression
Y_{j i} = (\sum_h \alpha_{h j} * W_{i h}) +
(\sum_{h k} \beta_{h j k} * W_{i h} * X_{i k}) +
(\sum_l \gamma_{j l} * C_{i l}) +
error.
The marginal model tests the trait association only in one
cell type h, under the linear regression,
Y_{j i} = (\sum_{h'} \alpha_{h' j} * W_{i h'}) +
(\sum_k \beta_{h j k} * W_{i h} * X_{i k}) +
(\sum_l \gamma_{j l} * C_{i l}) +
error.
The nonlinear model simultaneously analyze cell type composition in
linear scale and differential expression/methylation in log/logit scale.
The normalizing function is the natural logarithm f = log for gene
expression, and f = logit for methylation. Conventional linear regression
can be formulated by defining f as the identity function. The three models
are named nls.log, nls.logit and nls.identity.
We denote the inverse function of f by g; g = exp for
gene expression, and g = logistic for methylation.
The mean normalized marker level of marker j in sample i becomes
\mu_{j i} = f(\sum_h W_{i h} g( \alpha_{h j} + \sum_k \beta_{h j k} * X_{i k} )) +
\sum_l \gamma_{j l} C_{i l}.
The statistical model is
f(Y_{j i}) = \mu_{j i} + \epsilon_{j i},
\epsilon_{j i} ~ N(0, \sigma^2_j).
The error of the marker level is is noramlly distributed with variance
\sigma^2_j, independently among samples.
The ridge regression aims to cope with multicollinearity of
the interacting terms W_{i h} * X_{i k}.
Ridge regression is fit by minimizing the residual sum of squares (RSS) plus
\lambda \sum_{h k} \beta_{h j k}^2, where \lambda > 0 is the
regularization parameter.
The propdiff tests try to cope with multicollinearity by, roughly speaking,
using mean-centered W_{i h}.
We obtain, instead of \beta_{h j k}, the deviation of
\beta_{h j k} from the average across cell types.
Accordingly, the null hypothesis changes.
The original null hypothesis was \beta_{h j k} = 0.
The null hypothesis when centered is
\beta_{h j k} - (\sum_{i h'} W_{i h'} \beta_{h' j k}) / (\sum_{i h'} W_{i h'}) = 0.
It becomes difficult to detect a signal for a major cell type,
because \beta_{h j k} would be close to the average across cell types.
The tests propdiff.log and propdiff.logit include
an additional preprocessing step that converts Y_{j i} to f(Y_{j i}).
Apart from the preprocessing, the computations are performed in linear scale.
As the preprocessing distorts the linearity between the dependent variable
and (the centered) W_{i h},
I actually think propdiff.identity is better.
Value
A list with one element, which is named "coefficients".
The element gives the estimate, statistic, p.value in tibble format.
In order to transform the estimate for \alpha_{h j} to the original scale,
apply plogis for test = nls.logit and
exp for test = nls.log.
The estimate for \beta_{h j k} by test = nls.log is
the natural logarithm of fold-change, not the log2.
If numerical convergence fails, NA is returned for that marker.
References
Lawless, J. F., & Wang, P. (1976). A simulation study of ridge and other regression estimators. Communications in Statistics - Theory and Methods, 5(4), 307–323. https://doi.org/10.1080/03610927608827353
See Also
ctcisQTL
Examples
data(GSE42861small)
X = GSE42861small$X
W = GSE42861small$W
Y = GSE42861small$Y
C = GSE42861small$C
result = ctassoc(X, W, Y, C = C)
result$coefficients