tcareg {TCA}R Documentation

Fitting a TCA regression model

Description

TCA regression allows to test, under several types of statistical tests, the effects of source-specific values on an outcome of interest (or on mediating components thereof). For example, in the context of tissue-level bulk DNA methylation data coming from a mixture of cell types (i.e. the input is methylation sites by individuals), tcareg allows to test for cell-type-specific effects of methylation on outcomes of interest (or on mediating components thereof).

Usage

tcareg(
  X,
  tca.mdl,
  y,
  C3 = NULL,
  test = "marginal_conditional",
  null_model = NULL,
  alternative_model = NULL,
  save_results = FALSE,
  fast_mode = TRUE,
  output = "TCA",
  sort_results = FALSE,
  parallel = FALSE,
  num_cores = NULL,
  log_file = "TCA.log",
  features_metadata = NULL,
  debug = FALSE,
  verbose = TRUE
)

Arguments

X

An m by n matrix of measurements of m features for n observations. Each column in X is assumed to be a mixture of k sources. Note that X must include row names and column names and that NA values are currently not supported. X should not include features that are constant across all observations.

tca.mdl

The value returned by applying tca to X.

y

An n by 1 matrix of an outcome of interest for each of the n observations in X. Note that y must include row names and column names and that NA values are currently not supported.

C3

An n by p3 design matrix of covariates that may affect y. Note that C3 must include row names and column names and should not include an intercept term. NA values are currently not supported.

test

A character vector with the type of test to perform on each of the features in X; one of the following options: 'marginal', 'marginal_conditional', 'joint', 'single_effect', or 'custom'. Setting 'marginal' or 'marginal_conditional' corresponds to testing each feature in X for a statistical relation between y and each of the k sources separately; for any particular source under test, the marginal_conditional option further accounts for possible effects of the rest of the k-1 sources ('marginal' will therefore tend to be more powerful in discovering truly related features, but at the same time more prone to falsely tagging the correct related sources if sources are highly correlated). Setting 'joint' or 'single_effect' corresponds to testing each feature for an overall statistical relation with y, while modeling source-specific effects; the latter option further assumes that the source-specific effects are the same within each feature ('single_effect' means only one degree of freedom and will therefore be more powerful when the assumption of a single effect within a feature holds). Finally, 'custom' corresponds to testing each feature in X for a statistical relation with y under a user-specified model (alternative model) with respect to a null model (null model); for example, for testing for relation of the combined (potentially different) effects of features 1 and 2 while accounting for the (potentially different) effects of 3 and 4, set the null model to be sources 3, 4 and the alternative model to be sources 1, 2, 3, 4. Indicating that null_model assumes no effect for any of the sources can be done by setting it to NULL.

null_model

A vector with a subset of the names of the sources in tca.mdl$W to be used as a null model (activated only if test == 'custom'). Note that the null model must be nested within the alternative model; set null_model to be NULL for indicating no effect for any of the sources under the null model.

alternative_model

A vector with a subset (or all) of the names of the sources in tca.mdl$W to be used as an alternative model (activated only if test == 'custom').

save_results

A logical value indicating whether to save the returned results in a file. If test == 'marginal' or (fast_mode == TRUE and test == 'marginal_conditional') then k files will be saved (one for the results of each source).

fast_mode

A logical value indicating whether to use a fast version of TCA regression, in which source-specific-values are first estimated using the tensor function and then tested under a standard regression framework (see more details below).

output

Prefix for output files (activated only if save_results == TRUE).

sort_results

A logical value indicating whether to sort the results by their p-value (i.e. features with lower p-value will appear first in the results). This option is not available if fast_mode == TRUE and test == "marginal_conditional".

parallel

A logical value indicating whether to use parallel computing (possible when using a multi-core machine).

num_cores

A numeric value indicating the number of cores to use (activated only if parallel == TRUE). If num_cores == NULL then all available cores except for one will be used.

log_file

A path to an output log file. Note that if the file log_file already exists then logs will be appended to the end of the file. Set log_file to NULL to prevent output from being saved into a file; note that if verbose == FALSE then no output file will be generated regardless of the value of log_file.

features_metadata

A path to a csv file containing metadata about the features in X that will be added to the output files (activated only if save_results == TRUE). Each row in the metadata file should correspond to one feature in X (with the row name being the feature identifier, as it appears in the rows of X) and each column should correspond to one metadata descriptor (with an appropriate column name). Features that do not exist in X will be ignored and features in X with missing metadata information will show missing values.

debug

A logical value indicating whether to set the logger to a more detailed debug level; set debug to TRUE before reporting issues.

verbose

A logical value indicating whether to print logs.

Details

TCA models Z_{hj}^i as the source-specific value of observation i in feature j coming from source h (see tca for more details). A TCA regression model tests an outcome Y for a linear statistical relation with the source-specific values of a feature j by assuming:

Y_i = \alpha_{j,0} + \sum_{h=1}^k\beta_{hj} Z_{hj}^i + c_i^{(3)}\alpha_{j} + e_i

where \alpha_{j,0} is an intercept term, \beta_{hj} is the effect of source h, c_i^{(3)} and \alpha_j correspond to the p_3 covariate values of observation i (i.e. a row vector from C3) and their effect sizes, respectively, and e_i \sim N(0,\phi^2). In practice, if fast_mode == FALSE then tcareg fits this model using the conditional distribution Y|X, which, effectively, integrates over the random Z_{hj}^i. Statistical significance is then calculated using a likelihood ratio test (LRT). Alternatively, in case fast_mode == TRUE the above model is fitted by first learning point estimates for Z_{hj}^i using the tensor function and then assessing statistical significance using T-tests and partial F-tests under a standard regression framework. This alternative provides a substantial boost in speed.

Note that the null and alternative models will be set automatically, except when test == 'custom', in which case they will be set according to the user-specified null and alternative hypotheses.

Under the TCA regression model, several statistical tests can be performed by setting the argument test according to one of the following options:

1. If test == 'marginal', tcareg will perform the following for each source l. For each feature j, \beta_{lj} will be estimated and tested for a non-zero effect, while assuming \beta_{hj}=0 for all other sources h\neq l.

2. If test == 'marginal_conditional', tcareg will perform the following for each source l. For each feature j, \beta_{lj} will be estimated and tested for a non-zero effect, while also estimating the effect sizes \beta_{hj} for all other sources h\neq l (thus accounting for covariances between the estimated effects of different sources).

3. If test == 'joint', tcareg will estimate for each feature j the effect sizes of all k sources \beta_{1j},…,\beta_{kj} and then test the set of k estimates of each feature j for a joint effect.

4. If test == 'single_effect', tcareg will estimate for each feature j the effect sizes of all k sources \beta_{1j},…,\beta_{kj}, under the assumption that \beta_{1j} = … = \beta_{kj}, and then test the set of k estimates of each feature j for a joint effect.

5. If test == 'custom', tcareg will estimate for each feature j the effect sizes of a predefined set of sources (defined by a user-specified alternative model) and then test their estimates for a joint effect, while accounting for a nested predefined set of sources (defined by a user-specified null model).

Value

A list with the results of applying the TCA regression model to each of the features in X. If test == 'marginal' or (test == 'marginal_conditional' and fast_mode == FALSE) then a list of k such lists of results are returned, one for the results of each source.

phi

An estimate of the standard deviation of the i.i.d. component of variation in the TCA regression model.

beta

A matrix of effect size estimates for the source-specific effects, such that each row corresponds to the estimated effect sizes of one feature in X. The number of columns corresponds to the number of estimated effects (e.g., if test is set to marginal then beta will include a single column, if test is set to joint then beta will include k columns etc.).

intercept

An m by 1 matrix of estimates for the intercept term of each feature.

alpha

An m by p3 matrix of effect size estimates for the p3 covariates in C3, such that each row corresponds to the estimated effect sizes of one feature in X.

null_ll

An m by 1 matrix of the log-likelihood of the model under the null hypothesis. Returned only if fast_mode == FALSE.

alternative_ll

An m by 1 matrix of the log-likelihood of the model under the alternative hypothesis.

stats

An m by k matrix of T statistics for each source in each feature in X assuming test == "marginal_conditional" and fast_mode == TRUE; otherwise, an m by 1 matrix of an (partial) F statistic (if fast_mode == TRUE) or a likelihood-ratio test statistic (if fast_mode == FALSE) for each feature in X.

df

The degrees of freedom for deriving p-values.

pvals

An m by k matrix of p-values for each source in each feature in X assuming test == "marginal_conditional" and fast_mode == TRUE; otherwise, an m by 1 matrix of the p-value for each feature in X.

qvals

An m by k matrix of q-values (FDR-adjusted p-values) for each source in each feature in X assuming test == "marginal_conditional" and fast_mode == TRUE; otherwise, an m by 1 matrix of the q-value for each feature in X. Note that if test == "marginal_conditional" and fast_mode == TRUE then q-values are calculated for each source separately.

References

Rahmani E, Schweiger R, Rhead B, Criswell LA, Barcellos LF, Eskin E, Rosset S, Sankararaman S, Halperin E. Cell-type-specific resolution epigenetics without the need for cell sorting or single-cell biology. Nature Communications 2019.

Examples

n <- 50
m <- 10
k <- 3
p1 <- 1
p2 <- 1
data <- test_data(n, m, k, p1, p2, 0.01)
tca.mdl <- tca(X = data$X, W = data$W, C1 = data$C1, C2 = data$C2)
y <- matrix(rexp(n, rate=.1), ncol=1)
rownames(y) <- rownames(data$W)
# marginal conditional test:
res0 <- tcareg(data$X, tca.mdl, y)
# joint test:
res1 <- tcareg(data$X, tca.mdl, y, test = "joint")
# custom test, testing for a joint effect of sources 1,2 while accounting for source 3
res2 <- tcareg(data$X, tca.mdl, y, test = "custom", null_model = c("3"),
alternative_model = c("1","2","3"))
# custom test, testing for a joint effect of sources 1,2 assuming no effects under the null
res3 <- tcareg(data$X, tca.mdl, y, test = "custom", null_model = NULL,
alternative_model = c("1","2"))


[Package TCA version 1.2.1 Index]