tca {TCA}R Documentation

Fitting the TCA model

Description

Fits the TCA model for an input matrix of features by observations that are coming from a mixture of k sources, under the assumption that each observation is a mixture of unique (unobserved) source-specific values (in each feature in the data). This function further allows to statistically test the effect of covariates on source-specific values. For example, in the context of tissue-level bulk DNA methylation data coming from a mixture of cell types (i.e. the input is methylation sites by individuals), tca allows to model the methylation of each individual as a mixture of cell-type-specific methylation levels that are unique to the individual. In addition, it allows to statistically test the effects of covariates and phenotypes on methylation at the cell-type level.

Usage

tca(
  X,
  W,
  C1 = NULL,
  C1.map = NULL,
  C2 = NULL,
  refit_W = FALSE,
  refit_W.features = NULL,
  refit_W.sparsity = 500,
  refit_W.sd_threshold = 0.02,
  tau = NULL,
  vars.mle = FALSE,
  constrain_mu = FALSE,
  parallel = FALSE,
  num_cores = NULL,
  max_iters = 10,
  log_file = "TCA.log",
  debug = FALSE,
  verbose = TRUE
)

Arguments

X

An m by n matrix of measurements of m features for n observations. Each column in X is assumed to be a mixture of k sources. Note that X must include row names and column names and that NA values are currently not supported. X should not include features that are constant across all observations.

W

An n by k matrix of weights - the weights of k sources for each of the n mixtures (observations). All the weights must be positive and each row - corresponding to the weights of a single observation - must sum up to 1. Note that W must include row names and column names and that NA values are currently not supported. In cases where only noisy estimates of W are available, tca can be set to re-estimate W (see refit_W).

C1

An n by p1 design matrix of covariates that may affect the hidden source-specific values (possibly a different effect size in each source). Note that C1 must include row names and column names and should not include an intercept term. NA values are currently not supported. Note that each covariate in C1 results in k additional parameters in the model of each feature, therefore, in order to alleviate the possibility of model overfitting, it is advised to be mindful of the balance between the size of C1 and the sample size in X.

C1.map

An p1 by k matrix of 0/1 values, indicating for each of the p1 covariates in C1 whether to consider its potential effects on the values of each of the k sources (e.g., if position i,j in C1.map is 1 then the potential effect of the i-th covariate in C1 on the j-th source will be considered). If C1.map == NULL then effects for all covariates in C1 will be considered in each of the sources. Note that C1.map is available only if constrain_mu == TRUE.

C2

An n by p2 design matrix of covariates that may affect the mixture (i.e. rather than directly the sources of the mixture; for example, variables that capture biases in the collection of the measurements). Note that C2 must include row names and column names and should not include an intercept term. NA values are currently not supported.

refit_W

A logical value indicating whether to re-estimate the input W under the TCA model.

refit_W.features

A vector with the names of the features in X to consider when re-estimating W (i.e. when refit_W == TRUE). This is useful since in general only a subset of the features in X are expected to be highly informative for estimating W. If refit_W.features == NULL then the ReFACTor algorithm will be used for performing feature selection (see also refit_W.sparsity, refit_W.sd_threshold).

refit_W.sparsity

A numeric value indicating the number of features to select using the ReFACTor algorithm when re-estimating W (activated only if refit_W == TRUE and refit_W.features == NULL). Note that refit_W.sparsity must be lower or equal to the number of features in X. For more information, see the argument sparsity in refactor.

refit_W.sd_threshold

A numeric value indicating a standard deviation threshold to be used for excluding low-variance features in X (activated only if refit_W == TRUE and refit_W.features == NULL). For more information, see the argument sd_threshold in refactor.

tau

A non-negative numeric value of the standard deviation of the measurement noise (i.e. the i.i.d. component of variation in the model). If tau == NULL then tca will estimate tau.

vars.mle

A logical value indicating whether to use maximum likelihood estimation when learning the variances in the model. If vars.mle == FALSE then tca will use a non-negative least-squares optimization for learning the variances. In practice, both approaches appear to provide highly correlated models, however, setting vars.mle == FALSEallows a substantial speedup.

constrain_mu

A logical value indicating whether to constrain the estimates of the mean parameters (i.e. \{\mu_{hj}\}; see details below), in which case they will be constrained to the range of the values in X. Note that if constrain_mu == TRUE then tca will not output p-values for the effect sizes of the covariates in C1 and in C2.

parallel

A logical value indicating whether to use parallel computing (possible when using a multi-core machine).

num_cores

A numeric value indicating the number of cores to use (activated only if parallel == TRUE). If num_cores == NULL then all available cores except for one will be used.

max_iters

A numeric value indicating the maximal number of iterations to use in the optimization of the TCA model (max_iters iterations will be used as long as the optimization does not converge earlier).

log_file

A path to an output log file. Note that if the file log_file already exists then logs will be appended to the end of the file. Set log_file to NULL to prevent output from being saved into a file; note that if verbose == FALSE then no output file will be generated regardless of the value of log_file.

debug

A logical value indicating whether to set the logger to a more detailed debug level; set debug to TRUE before reporting issues.

verbose

A logical value indicating whether to print logs.

Details

The TCA model assumes that the hidden source-specific values are random variables. Formally, denote by Z_{hj}^i the source-specific value of observation i in feature j source h, the TCA model assumes:

Z_{hj}^i \sim N(\mu_{hj},\sigma_{hj}^2)

where \mu_{hj},\sigma_{hj} represent the mean and standard deviation that are specific to feature j, source h. The model further assumes that the observed value of observation i in feature j is a mixture of k different sources:

X_{ji} = \sum_{h=1}^k W_{ih}Z_{hj}^i + \epsilon_{ji}

where W_{ih} is the non-negative proportion of source h in the mixture of observation i such that \sum_{h=1}^kW_{ih} = 1, and \epsilon_{ji} \sim N(0,\tau^2) is an i.i.d. component of variation that models measurement noise. Note that the mixture proportions in W are, in general, unique for each individual, therefore each entry in X is coming from a unique distribution (i.e. a different mean and a different variance).

In cases where the true W is unknown, tca can be provided with noisy estimates of W and then re-estimate W as part of the optimization procedure (see argument refit_W). These initial estimates should not be random but rather capture the information in W to some extent. When the argument refit_W is used, it is typically the case that only a subset of the features should be used for re-estimating W. Therefore, when re-estimating W, tca performs feature selection using the ReFACTor algorithm; alternatively, it can also be provided with a user-specified list of features to be used in the re-estimation, assuming that such list of features that are most informative for estimating W exist (see argument refit_W.features).

Covariates that systematically affect the source-specific values Z_{hj}^i can be further considered (see argument C1). In that case, we assume:

Z_{hj}^i \sim N(\mu_{hj}+c^{(1)}_i \gamma_j^h,\sigma_{hj}^2)

where c^{(1)}_i and \gamma_j^h correspond to the p_1 covariate values of observation i (i.e. a row vector from C1) and their effect sizes, respectively.

Covariates that systematically affect the mixture values X_{ji}, such as variables that capture technical biases in the collection of the measurements, can also be considered (see argument C2). In that case, we assume:

X_{ji} = \sum_{h=1}^k W_{ih}Z_{hj}^i + c^{(2)}_i \delta_j + \epsilon_{ij}

where c^{(2)}_i and \delta_j correspond to the p_2 covariate values of observation i (i.e. a row vector from C2) and their effect sizes, respectively.

Since the standard deviation of X_{ji} is specific to observation i and feature j, we can obtain p-values for the estimates of \gamma_j^h and \delta_j by dividing each observed data point x_{ji} by its estimated standard deviation and calculating T-statistics under a standard linear regression framework.

Value

A list with the estimated parameters of the model. This list can be then used as the input to other functions such as tcareg.

W

An n by k matrix of weights. If refit_W == TRUE then this is the re-estimated W, and otherwise this is the input W

mus_hat

An m by k matrix of estimates for the mean of each source in each feature.

sigmas_hat

An m by k matrix of estimates for the standard deviation of each source in each feature.

tau_hat

An estimate of the standard deviation of the i.i.d. component of variation in X. If an input value was provided for tau (i.e. tau != NULL) then tau_hat == tau.

gammas_hat

An m by k*p1 matrix of the estimated effects of the p1 covariates in C1 on each of the m features in X, where the first p1 columns are the source-specific effects of the p1 covariates on the first source, the following p1 columns are the source-specific effects on the second source and so on.

deltas_hat

An m by p2 matrix of the estimated effects of the p2 covariates in C2 on the mixture values of each of the m features in X.

gammas_hat_pvals

An m by k*p1 matrix of p-values for the estimates in gammas_hat (based on a T-test). Note that gammas_hat_pvals is set to NULL if constrain_mu == TRUE.

gammas_hat_pvals.joint

An m by p1 matrix of p-values for the joint effects (i.e. across all k sources) of each of the p1 covariates in C1 on each of the m features in X (based on a partial F-test). In other words, these are p-values for the combined statistical effects of each one of the p1 covariates on each of the m features under the TCA model. Note that gammas_hat_pvals.joint is set to NULL if constrain_mu == TRUE.

deltas_hat_pvals

An m by p2 matrix of p-values for the estimates in deltas_hat (based on a T-test). Note that deltas_hat_pvals is set to NULL if constrain_mu == TRUE.

References

Rahmani E, Schweiger R, Rhead B, Criswell LA, Barcellos LF, Eskin E, Rosset S, Sankararaman S, Halperin E. Cell-type-specific resolution epigenetics without the need for cell sorting or single-cell biology. Nature Communications 2019.

Rahmani E, Zaitlen N, Baran Y, Eng C, Hu D, Galanter J, Oh S, Burchard EG, Eskin E, Zou J, Halperin E. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nature Methods 2016.

Examples

data <- test_data(100, 20, 3, 1, 1, 0.01)
tca.mdl <- tca(X = data$X, W = data$W, C1 = data$C1, C2 = data$C2)


[Package TCA version 1.2.1 Index]