tca {TCA} | R Documentation |
Fitting the TCA model
Description
Fits the TCA model for an input matrix of features by observations that are coming from a mixture of k
sources, under the assumption that each observation is a mixture of unique (unobserved) source-specific values (in each feature in the data). This function further allows to statistically test the effect of covariates on source-specific values. For example, in the context of tissue-level bulk DNA methylation data coming from a mixture of cell types (i.e. the input is methylation sites by individuals), tca
allows to model the methylation of each individual as a mixture of cell-type-specific methylation levels that are unique to the individual. In addition, it allows to statistically test the effects of covariates and phenotypes on methylation at the cell-type level.
Usage
tca(
X,
W,
C1 = NULL,
C1.map = NULL,
C2 = NULL,
refit_W = FALSE,
refit_W.features = NULL,
refit_W.sparsity = 500,
refit_W.sd_threshold = 0.02,
tau = NULL,
vars.mle = FALSE,
constrain_mu = FALSE,
parallel = FALSE,
num_cores = NULL,
max_iters = 10,
log_file = "TCA.log",
debug = FALSE,
verbose = TRUE
)
Arguments
X |
An |
W |
An |
C1 |
An |
C1.map |
An |
C2 |
An |
refit_W |
A logical value indicating whether to re-estimate the input |
refit_W.features |
A vector with the names of the features in |
refit_W.sparsity |
A numeric value indicating the number of features to select using the ReFACTor algorithm when re-estimating |
refit_W.sd_threshold |
A numeric value indicating a standard deviation threshold to be used for excluding low-variance features in |
tau |
A non-negative numeric value of the standard deviation of the measurement noise (i.e. the i.i.d. component of variation in the model). If |
vars.mle |
A logical value indicating whether to use maximum likelihood estimation when learning the variances in the model. If |
constrain_mu |
A logical value indicating whether to constrain the estimates of the mean parameters (i.e. |
parallel |
A logical value indicating whether to use parallel computing (possible when using a multi-core machine). |
num_cores |
A numeric value indicating the number of cores to use (activated only if |
max_iters |
A numeric value indicating the maximal number of iterations to use in the optimization of the TCA model ( |
log_file |
A path to an output log file. Note that if the file |
debug |
A logical value indicating whether to set the logger to a more detailed debug level; set |
verbose |
A logical value indicating whether to print logs. |
Details
The TCA model assumes that the hidden source-specific values are random variables. Formally, denote by the source-specific value of observation
in feature
source
, the TCA model assumes:
where represent the mean and standard deviation that are specific to feature
, source
. The model further assumes that the observed value of observation
in feature
is a mixture of
different sources:
where is the non-negative proportion of source
in the mixture of observation
such that
, and
is an i.i.d. component of variation that models measurement noise. Note that the mixture proportions in
are, in general, unique for each individual, therefore each entry in
is coming from a unique distribution (i.e. a different mean and a different variance).
In cases where the true W
is unknown, tca
can be provided with noisy estimates of W
and then re-estimate W
as part of the optimization procedure (see argument refit_W
). These initial estimates should not be random but rather capture the information in W
to some extent. When the argument refit_W
is used, it is typically the case that only a subset of the features should be used for re-estimating W
. Therefore, when re-estimating W
, tca
performs feature selection using the ReFACTor algorithm; alternatively, it can also be provided with a user-specified list of features to be used in the re-estimation, assuming that such list of features that are most informative for estimating exist (see argument
refit_W.features
).
Covariates that systematically affect the source-specific values can be further considered (see argument
C1
). In that case, we assume:
where and
correspond to the
covariate values of observation
(i.e. a row vector from
C1
) and their effect sizes, respectively.
Covariates that systematically affect the mixture values , such as variables that capture technical biases in the collection of the measurements, can also be considered (see argument
C2
). In that case, we assume:
where and
correspond to the
covariate values of observation
(i.e. a row vector from
C2
) and their effect sizes, respectively.
Since the standard deviation of is specific to observation
and feature
, we can obtain p-values for the estimates of
and
by dividing each observed data point
by its estimated standard deviation and calculating T-statistics under a standard linear regression framework.
Value
A list with the estimated parameters of the model. This list can be then used as the input to other functions such as tcareg.
W |
An |
mus_hat |
An |
sigmas_hat |
An |
tau_hat |
An estimate of the standard deviation of the i.i.d. component of variation in |
gammas_hat |
An |
deltas_hat |
An |
gammas_hat_pvals |
An |
gammas_hat_pvals.joint |
An |
deltas_hat_pvals |
An |
References
Rahmani E, Schweiger R, Rhead B, Criswell LA, Barcellos LF, Eskin E, Rosset S, Sankararaman S, Halperin E. Cell-type-specific resolution epigenetics without the need for cell sorting or single-cell biology. Nature Communications 2019.
Rahmani E, Zaitlen N, Baran Y, Eng C, Hu D, Galanter J, Oh S, Burchard EG, Eskin E, Zou J, Halperin E. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nature Methods 2016.
Examples
data <- test_data(100, 20, 3, 1, 1, 0.01)
tca.mdl <- tca(X = data$X, W = data$W, C1 = data$C1, C2 = data$C2)