R: Fitting the TCA model

tca {TCA}

R Documentation

Fitting the TCA model

Description

Fits the TCA model for an input matrix of features by observations that are coming from a mixture of k sources, under the assumption that each observation is a mixture of unique (unobserved) source-specific values (in each feature in the data). This function further allows to statistically test the effect of covariates on source-specific values. For example, in the context of tissue-level bulk DNA methylation data coming from a mixture of cell types (i.e. the input is methylation sites by individuals), tca allows to model the methylation of each individual as a mixture of cell-type-specific methylation levels that are unique to the individual. In addition, it allows to statistically test the effects of covariates and phenotypes on methylation at the cell-type level.

Usage

tca(
  X,
  W,
  C1 = NULL,
  C1.map = NULL,
  C2 = NULL,
  refit_W = FALSE,
  refit_W.features = NULL,
  refit_W.sparsity = 500,
  refit_W.sd_threshold = 0.02,
  tau = NULL,
  vars.mle = FALSE,
  constrain_mu = FALSE,
  parallel = FALSE,
  num_cores = NULL,
  max_iters = 10,
  log_file = "TCA.log",
  debug = FALSE,
  verbose = TRUE
)

Arguments

`X`	An `m` by `n` matrix of measurements of `m` features for `n` observations. Each column in `X` is assumed to be a mixture of `k` sources. Note that `X` must include row names and column names and that NA values are currently not supported. `X` should not include features that are constant across all observations.
`W`	An `n` by `k` matrix of weights - the weights of `k` sources for each of the `n` mixtures (observations). All the weights must be positive and each row - corresponding to the weights of a single observation - must sum up to 1. Note that `W` must include row names and column names and that NA values are currently not supported. In cases where only noisy estimates of `W` are available, `tca` can be set to re-estimate `W` (see `refit_W`).
`C1`	An `n` by `p1` design matrix of covariates that may affect the hidden source-specific values (possibly a different effect size in each source). Note that `C1` must include row names and column names and should not include an intercept term. NA values are currently not supported. Note that each covariate in `C1` results in `k` additional parameters in the model of each feature, therefore, in order to alleviate the possibility of model overfitting, it is advised to be mindful of the balance between the size of `C1` and the sample size in `X`.
`C1.map`	An `p1` by `k` matrix of 0/1 values, indicating for each of the `p1` covariates in `C1` whether to consider its potential effects on the values of each of the `k` sources (e.g., if position `i,j` in `C1.map` is 1 then the potential effect of the `i`-th covariate in `C1` on the j-th source will be considered). If `C1.map == NULL` then effects for all covariates in `C1` will be considered in each of the sources. Note that `C1.map` is available only if `constrain_mu == TRUE`.
`C2`	An `n` by `p2` design matrix of covariates that may affect the mixture (i.e. rather than directly the sources of the mixture; for example, variables that capture biases in the collection of the measurements). Note that `C2` must include row names and column names and should not include an intercept term. NA values are currently not supported.
`refit_W`	A logical value indicating whether to re-estimate the input `W` under the TCA model.
`refit_W.features`	A vector with the names of the features in `X` to consider when re-estimating `W` (i.e. when `refit_W == TRUE`). This is useful since in general only a subset of the features in `X` are expected to be highly informative for estimating `W`. If `refit_W.features == NULL` then the ReFACTor algorithm will be used for performing feature selection (see also `refit_W.sparsity, refit_W.sd_threshold`).
`refit_W.sparsity`	A numeric value indicating the number of features to select using the ReFACTor algorithm when re-estimating `W` (activated only if `refit_W == TRUE` and `refit_W.features == NULL`). Note that `refit_W.sparsity` must be lower or equal to the number of features in `X`. For more information, see the argument `sparsity` in refactor.
`refit_W.sd_threshold`	A numeric value indicating a standard deviation threshold to be used for excluding low-variance features in `X` (activated only if `refit_W == TRUE` and `refit_W.features == NULL`). For more information, see the argument `sd_threshold` in refactor.
`tau`	A non-negative numeric value of the standard deviation of the measurement noise (i.e. the i.i.d. component of variation in the model). If `tau == NULL` then `tca` will estimate `tau`.
`vars.mle`	A logical value indicating whether to use maximum likelihood estimation when learning the variances in the model. If `vars.mle == FALSE` then `tca` will use a non-negative least-squares optimization for learning the variances. In practice, both approaches appear to provide highly correlated models, however, setting `vars.mle == FALSE`allows a substantial speedup.
`constrain_mu`	A logical value indicating whether to constrain the estimates of the mean parameters (i.e. `\{\mu_{hj}\}`; see details below), in which case they will be constrained to the range of the values in `X`. Note that if `constrain_mu == TRUE` then `tca` will not output p-values for the effect sizes of the covariates in `C1` and in `C2`.
`parallel`	A logical value indicating whether to use parallel computing (possible when using a multi-core machine).
`num_cores`	A numeric value indicating the number of cores to use (activated only if `parallel == TRUE`). If `num_cores == NULL` then all available cores except for one will be used.
`max_iters`	A numeric value indicating the maximal number of iterations to use in the optimization of the TCA model (`max_iters` iterations will be used as long as the optimization does not converge earlier).
`log_file`	A path to an output log file. Note that if the file `log_file` already exists then logs will be appended to the end of the file. Set `log_file` to `NULL` to prevent output from being saved into a file; note that if `verbose == FALSE` then no output file will be generated regardless of the value of `log_file`.
`debug`	A logical value indicating whether to set the logger to a more detailed debug level; set `debug` to `TRUE` before reporting issues.
`verbose`	A logical value indicating whether to print logs.

Details

The TCA model assumes that the hidden source-specific values are random variables. Formally, denote by Z_{hj}^i the source-specific value of observation i in feature j source h, the TCA model assumes:

Z_{hj}^i \sim N(\mu_{hj},\sigma_{hj}^2)

where \mu_{hj},\sigma_{hj} represent the mean and standard deviation that are specific to feature j, source h. The model further assumes that the observed value of observation i in feature j is a mixture of k different sources:

X_{ji} = \sum_{h=1}^k W_{ih}Z_{hj}^i + \epsilon_{ji}

where W_{ih} is the non-negative proportion of source h in the mixture of observation i such that \sum_{h=1}^kW_{ih} = 1, and \epsilon_{ji} \sim N(0,\tau^2) is an i.i.d. component of variation that models measurement noise. Note that the mixture proportions in W are, in general, unique for each individual, therefore each entry in X is coming from a unique distribution (i.e. a different mean and a different variance).

In cases where the true W is unknown, tca can be provided with noisy estimates of W and then re-estimate W as part of the optimization procedure (see argument refit_W). These initial estimates should not be random but rather capture the information in W to some extent. When the argument refit_W is used, it is typically the case that only a subset of the features should be used for re-estimating W. Therefore, when re-estimating W, tca performs feature selection using the ReFACTor algorithm; alternatively, it can also be provided with a user-specified list of features to be used in the re-estimation, assuming that such list of features that are most informative for estimating W exist (see argument refit_W.features).

Covariates that systematically affect the source-specific values Z_{hj}^i can be further considered (see argument C1). In that case, we assume:

Z_{hj}^i \sim N(\mu_{hj}+c^{(1)}_i \gamma_j^h,\sigma_{hj}^2)

where c^{(1)}_i and \gamma_j^h correspond to the p_1 covariate values of observation i (i.e. a row vector from C1) and their effect sizes, respectively.

Covariates that systematically affect the mixture values X_{ji}, such as variables that capture technical biases in the collection of the measurements, can also be considered (see argument C2). In that case, we assume:

X_{ji} = \sum_{h=1}^k W_{ih}Z_{hj}^i + c^{(2)}_i \delta_j + \epsilon_{ij}

where c^{(2)}_i and \delta_j correspond to the p_2 covariate values of observation i (i.e. a row vector from C2) and their effect sizes, respectively.

Since the standard deviation of X_{ji} is specific to observation i and feature j, we can obtain p-values for the estimates of \gamma_j^h and \delta_j by dividing each observed data point x_{ji} by its estimated standard deviation and calculating T-statistics under a standard linear regression framework.

Value

A list with the estimated parameters of the model. This list can be then used as the input to other functions such as tcareg.

`W`	An `n` by `k` matrix of weights. If `refit_W == TRUE` then this is the re-estimated `W`, and otherwise this is the input `W`
`mus_hat`	An `m` by `k` matrix of estimates for the mean of each source in each feature.
`sigmas_hat`	An `m` by `k` matrix of estimates for the standard deviation of each source in each feature.
`tau_hat`	An estimate of the standard deviation of the i.i.d. component of variation in `X`. If an input value was provided for `tau` (i.e. `tau != NULL`) then `tau_hat == tau`.
`gammas_hat`	An `m` by `k*p1` matrix of the estimated effects of the `p1` covariates in `C1` on each of the `m` features in `X`, where the first `p1` columns are the source-specific effects of the `p1` covariates on the first source, the following `p1` columns are the source-specific effects on the second source and so on.
`deltas_hat`	An `m` by `p2` matrix of the estimated effects of the `p2` covariates in `C2` on the mixture values of each of the `m` features in `X`.
`gammas_hat_pvals`	An `m` by `k*p1` matrix of p-values for the estimates in `gammas_hat` (based on a T-test). Note that `gammas_hat_pvals` is set to NULL if `constrain_mu == TRUE`.
`gammas_hat_pvals.joint`	An `m` by `p1` matrix of p-values for the joint effects (i.e. across all `k` sources) of each of the `p1` covariates in `C1` on each of the `m` features in `X` (based on a partial F-test). In other words, these are p-values for the combined statistical effects of each one of the `p1` covariates on each of the `m` features under the TCA model. Note that `gammas_hat_pvals.joint` is set to NULL if `constrain_mu == TRUE`.
`deltas_hat_pvals`	An `m` by `p2` matrix of p-values for the estimates in `deltas_hat` (based on a T-test). Note that `deltas_hat_pvals` is set to NULL if `constrain_mu == TRUE`.

References

Rahmani E, Schweiger R, Rhead B, Criswell LA, Barcellos LF, Eskin E, Rosset S, Sankararaman S, Halperin E. Cell-type-specific resolution epigenetics without the need for cell sorting or single-cell biology. Nature Communications 2019.

Rahmani E, Zaitlen N, Baran Y, Eng C, Hu D, Galanter J, Oh S, Burchard EG, Eskin E, Zou J, Halperin E. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nature Methods 2016.

Examples

data <- test_data(100, 20, 3, 1, 1, 0.01)
tca.mdl <- tca(X = data$X, W = data$W, C1 = data$C1, C2 = data$C2)

[Package TCA version 1.2.1 Index]