tca {TCA} | R Documentation |
Fitting the TCA model
Description
Fits the TCA model for an input matrix of features by observations that are coming from a mixture of k
sources, under the assumption that each observation is a mixture of unique (unobserved) source-specific values (in each feature in the data). This function further allows to statistically test the effect of covariates on source-specific values. For example, in the context of tissue-level bulk DNA methylation data coming from a mixture of cell types (i.e. the input is methylation sites by individuals), tca
allows to model the methylation of each individual as a mixture of cell-type-specific methylation levels that are unique to the individual. In addition, it allows to statistically test the effects of covariates and phenotypes on methylation at the cell-type level.
Usage
tca(
X,
W,
C1 = NULL,
C1.map = NULL,
C2 = NULL,
refit_W = FALSE,
refit_W.features = NULL,
refit_W.sparsity = 500,
refit_W.sd_threshold = 0.02,
tau = NULL,
vars.mle = FALSE,
constrain_mu = FALSE,
parallel = FALSE,
num_cores = NULL,
max_iters = 10,
log_file = "TCA.log",
debug = FALSE,
verbose = TRUE
)
Arguments
X |
An |
W |
An |
C1 |
An |
C1.map |
An |
C2 |
An |
refit_W |
A logical value indicating whether to re-estimate the input |
refit_W.features |
A vector with the names of the features in |
refit_W.sparsity |
A numeric value indicating the number of features to select using the ReFACTor algorithm when re-estimating |
refit_W.sd_threshold |
A numeric value indicating a standard deviation threshold to be used for excluding low-variance features in |
tau |
A non-negative numeric value of the standard deviation of the measurement noise (i.e. the i.i.d. component of variation in the model). If |
vars.mle |
A logical value indicating whether to use maximum likelihood estimation when learning the variances in the model. If |
constrain_mu |
A logical value indicating whether to constrain the estimates of the mean parameters (i.e. |
parallel |
A logical value indicating whether to use parallel computing (possible when using a multi-core machine). |
num_cores |
A numeric value indicating the number of cores to use (activated only if |
max_iters |
A numeric value indicating the maximal number of iterations to use in the optimization of the TCA model ( |
log_file |
A path to an output log file. Note that if the file |
debug |
A logical value indicating whether to set the logger to a more detailed debug level; set |
verbose |
A logical value indicating whether to print logs. |
Details
The TCA model assumes that the hidden source-specific values are random variables. Formally, denote by Z_{hj}^i
the source-specific value of observation i
in feature j
source h
, the TCA model assumes:
Z_{hj}^i \sim N(\mu_{hj},\sigma_{hj}^2)
where \mu_{hj},\sigma_{hj}
represent the mean and standard deviation that are specific to feature j
, source h
. The model further assumes that the observed value of observation i
in feature j
is a mixture of k
different sources:
X_{ji} = \sum_{h=1}^k W_{ih}Z_{hj}^i + \epsilon_{ji}
where W_{ih}
is the non-negative proportion of source h
in the mixture of observation i
such that \sum_{h=1}^kW_{ih} = 1
, and \epsilon_{ji} \sim N(0,\tau^2)
is an i.i.d. component of variation that models measurement noise. Note that the mixture proportions in W
are, in general, unique for each individual, therefore each entry in X
is coming from a unique distribution (i.e. a different mean and a different variance).
In cases where the true W
is unknown, tca
can be provided with noisy estimates of W
and then re-estimate W
as part of the optimization procedure (see argument refit_W
). These initial estimates should not be random but rather capture the information in W
to some extent. When the argument refit_W
is used, it is typically the case that only a subset of the features should be used for re-estimating W
. Therefore, when re-estimating W
, tca
performs feature selection using the ReFACTor algorithm; alternatively, it can also be provided with a user-specified list of features to be used in the re-estimation, assuming that such list of features that are most informative for estimating W
exist (see argument refit_W.features
).
Covariates that systematically affect the source-specific values Z_{hj}^i
can be further considered (see argument C1
). In that case, we assume:
Z_{hj}^i \sim N(\mu_{hj}+c^{(1)}_i \gamma_j^h,\sigma_{hj}^2)
where c^{(1)}_i
and \gamma_j^h
correspond to the p_1
covariate values of observation i
(i.e. a row vector from C1
) and their effect sizes, respectively.
Covariates that systematically affect the mixture values X_{ji}
, such as variables that capture technical biases in the collection of the measurements, can also be considered (see argument C2
). In that case, we assume:
X_{ji} = \sum_{h=1}^k W_{ih}Z_{hj}^i + c^{(2)}_i \delta_j + \epsilon_{ij}
where c^{(2)}_i
and \delta_j
correspond to the p_2
covariate values of observation i
(i.e. a row vector from C2
) and their effect sizes, respectively.
Since the standard deviation of X_{ji}
is specific to observation i
and feature j
, we can obtain p-values for the estimates of \gamma_j^h
and \delta_j
by dividing each observed data point x_{ji}
by its estimated standard deviation and calculating T-statistics under a standard linear regression framework.
Value
A list with the estimated parameters of the model. This list can be then used as the input to other functions such as tcareg.
W |
An |
mus_hat |
An |
sigmas_hat |
An |
tau_hat |
An estimate of the standard deviation of the i.i.d. component of variation in |
gammas_hat |
An |
deltas_hat |
An |
gammas_hat_pvals |
An |
gammas_hat_pvals.joint |
An |
deltas_hat_pvals |
An |
References
Rahmani E, Schweiger R, Rhead B, Criswell LA, Barcellos LF, Eskin E, Rosset S, Sankararaman S, Halperin E. Cell-type-specific resolution epigenetics without the need for cell sorting or single-cell biology. Nature Communications 2019.
Rahmani E, Zaitlen N, Baran Y, Eng C, Hu D, Galanter J, Oh S, Burchard EG, Eskin E, Zou J, Halperin E. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nature Methods 2016.
Examples
data <- test_data(100, 20, 3, 1, 1, 0.01)
tca.mdl <- tca(X = data$X, W = data$W, C1 = data$C1, C2 = data$C2)