R: Fitting a TCA regression model

tcareg {TCA}

R Documentation

Fitting a TCA regression model

Description

TCA regression allows to test, under several types of statistical tests, the effects of source-specific values on an outcome of interest (or on mediating components thereof). For example, in the context of tissue-level bulk DNA methylation data coming from a mixture of cell types (i.e. the input is methylation sites by individuals), tcareg allows to test for cell-type-specific effects of methylation on outcomes of interest (or on mediating components thereof).

Usage

tcareg(
  X,
  tca.mdl,
  y,
  C3 = NULL,
  test = "marginal_conditional",
  null_model = NULL,
  alternative_model = NULL,
  save_results = FALSE,
  fast_mode = TRUE,
  output = "TCA",
  sort_results = FALSE,
  parallel = FALSE,
  num_cores = NULL,
  log_file = "TCA.log",
  features_metadata = NULL,
  debug = FALSE,
  verbose = TRUE
)

Arguments

`X`	An `m` by `n` matrix of measurements of `m` features for `n` observations. Each column in `X` is assumed to be a mixture of `k` sources. Note that `X` must include row names and column names and that NA values are currently not supported. `X` should not include features that are constant across all observations.
`tca.mdl`	The value returned by applying tca to `X`.
`y`	An `n` by 1 matrix of an outcome of interest for each of the `n` observations in `X`. Note that `y` must include row names and column names and that NA values are currently not supported.
`C3`	An `n` by `p3` design matrix of covariates that may affect `y`. Note that `C3` must include row names and column names and should not include an intercept term. NA values are currently not supported.
`test`	A character vector with the type of test to perform on each of the features in `X`; one of the following options: `'marginal'`, `'marginal_conditional'`, `'joint'`, `'single_effect'`, or `'custom'`. Setting `'marginal'` or `'marginal_conditional'` corresponds to testing each feature in `X` for a statistical relation between `y` and each of the `k` sources separately; for any particular source under test, the `marginal_conditional` option further accounts for possible effects of the rest of the `k-1` sources (`'marginal'` will therefore tend to be more powerful in discovering truly related features, but at the same time more prone to falsely tagging the correct related sources if sources are highly correlated). Setting `'joint'` or `'single_effect'` corresponds to testing each feature for an overall statistical relation with `y`, while modeling source-specific effects; the latter option further assumes that the source-specific effects are the same within each feature (`'single_effect'` means only one degree of freedom and will therefore be more powerful when the assumption of a single effect within a feature holds). Finally, `'custom'` corresponds to testing each feature in `X` for a statistical relation with `y` under a user-specified model (alternative model) with respect to a null model (null model); for example, for testing for relation of the combined (potentially different) effects of features 1 and 2 while accounting for the (potentially different) effects of 3 and 4, set the null model to be sources 3, 4 and the alternative model to be sources 1, 2, 3, 4. Indicating that `null_model` assumes no effect for any of the sources can be done by setting it to `NULL`.
`null_model`	A vector with a subset of the names of the sources in `tca.mdl$W` to be used as a null model (activated only if `test == 'custom'`). Note that the null model must be nested within the alternative model; set `null_model` to be `NULL` for indicating no effect for any of the sources under the null model.
`alternative_model`	A vector with a subset (or all) of the names of the sources in `tca.mdl$W` to be used as an alternative model (activated only if `test == 'custom'`).
`save_results`	A logical value indicating whether to save the returned results in a file. If `test == 'marginal'` or (`fast_mode == TRUE` and `test == 'marginal_conditional'`) then `k` files will be saved (one for the results of each source).
`fast_mode`	A logical value indicating whether to use a fast version of TCA regression, in which source-specific-values are first estimated using the `tensor` function and then tested under a standard regression framework (see more details below).
`output`	Prefix for output files (activated only if `save_results == TRUE`).
`sort_results`	A logical value indicating whether to sort the results by their p-value (i.e. features with lower p-value will appear first in the results). This option is not available if `fast_mode == TRUE` and `test == "marginal_conditional"`.
`parallel`	A logical value indicating whether to use parallel computing (possible when using a multi-core machine).
`num_cores`	A numeric value indicating the number of cores to use (activated only if `parallel == TRUE`). If `num_cores == NULL` then all available cores except for one will be used.
`log_file`	A path to an output log file. Note that if the file `log_file` already exists then logs will be appended to the end of the file. Set `log_file` to `NULL` to prevent output from being saved into a file; note that if `verbose == FALSE` then no output file will be generated regardless of the value of `log_file`.
`features_metadata`	A path to a csv file containing metadata about the features in `X` that will be added to the output files (activated only if `save_results == TRUE`). Each row in the metadata file should correspond to one feature in `X` (with the row name being the feature identifier, as it appears in the rows of `X`) and each column should correspond to one metadata descriptor (with an appropriate column name). Features that do not exist in `X` will be ignored and features in `X` with missing metadata information will show missing values.
`debug`	A logical value indicating whether to set the logger to a more detailed debug level; set `debug` to `TRUE` before reporting issues.
`verbose`	A logical value indicating whether to print logs.

Details

TCA models Z_{hj}^i as the source-specific value of observation i in feature j coming from source h (see tca for more details). A TCA regression model tests an outcome Y for a linear statistical relation with the source-specific values of a feature j by assuming:

Y_i = \alpha_{j,0} + \sum_{h=1}^k\beta_{hj} Z_{hj}^i + c_i^{(3)}\alpha_{j} + e_i

where \alpha_{j,0} is an intercept term, \beta_{hj} is the effect of source h, c_i^{(3)} and \alpha_j correspond to the p_3 covariate values of observation i (i.e. a row vector from C3) and their effect sizes, respectively, and e_i \sim N(0,\phi^2). In practice, if fast_mode == FALSE then tcareg fits this model using the conditional distribution Y|X, which, effectively, integrates over the random Z_{hj}^i. Statistical significance is then calculated using a likelihood ratio test (LRT). Alternatively, in case fast_mode == TRUE the above model is fitted by first learning point estimates for Z_{hj}^i using the tensor function and then assessing statistical significance using T-tests and partial F-tests under a standard regression framework. This alternative provides a substantial boost in speed.

Note that the null and alternative models will be set automatically, except when test == 'custom', in which case they will be set according to the user-specified null and alternative hypotheses.

Under the TCA regression model, several statistical tests can be performed by setting the argument test according to one of the following options:

1. If test == 'marginal', tcareg will perform the following for each source l. For each feature j, \beta_{lj} will be estimated and tested for a non-zero effect, while assuming \beta_{hj}=0 for all other sources h\neq l.

2. If test == 'marginal_conditional', tcareg will perform the following for each source l. For each feature j, \beta_{lj} will be estimated and tested for a non-zero effect, while also estimating the effect sizes \beta_{hj} for all other sources h\neq l (thus accounting for covariances between the estimated effects of different sources).

3. If test == 'joint', tcareg will estimate for each feature j the effect sizes of all k sources \beta_{1j},…,\beta_{kj} and then test the set of k estimates of each feature j for a joint effect.

4. If test == 'single_effect', tcareg will estimate for each feature j the effect sizes of all k sources \beta_{1j},…,\beta_{kj}, under the assumption that \beta_{1j} = … = \beta_{kj}, and then test the set of k estimates of each feature j for a joint effect.

5. If test == 'custom', tcareg will estimate for each feature j the effect sizes of a predefined set of sources (defined by a user-specified alternative model) and then test their estimates for a joint effect, while accounting for a nested predefined set of sources (defined by a user-specified null model).

Value

A list with the results of applying the TCA regression model to each of the features in X. If test == 'marginal' or (test == 'marginal_conditional' and fast_mode == FALSE) then a list of k such lists of results are returned, one for the results of each source.

`phi`	An estimate of the standard deviation of the i.i.d. component of variation in the TCA regression model.
`beta`	A matrix of effect size estimates for the source-specific effects, such that each row corresponds to the estimated effect sizes of one feature in `X`. The number of columns corresponds to the number of estimated effects (e.g., if `test` is set to `marginal` then `beta` will include a single column, if `test` is set to `joint` then `beta` will include `k` columns etc.).
`intercept`	An `m` by `1` matrix of estimates for the intercept term of each feature.
`alpha`	An `m` by `p3` matrix of effect size estimates for the `p3` covariates in `C3`, such that each row corresponds to the estimated effect sizes of one feature in `X`.
`null_ll`	An `m` by `1` matrix of the log-likelihood of the model under the null hypothesis. Returned only if `fast_mode == FALSE`.
`alternative_ll`	An `m` by `1` matrix of the log-likelihood of the model under the alternative hypothesis.
`stats`	An `m` by `k` matrix of T statistics for each source in each feature in `X` assuming `test == "marginal_conditional"` and `fast_mode == TRUE`; otherwise, an `m` by `1` matrix of an (partial) F statistic (if `fast_mode == TRUE`) or a likelihood-ratio test statistic (if `fast_mode == FALSE`) for each feature in `X`.
`df`	The degrees of freedom for deriving p-values.
`pvals`	An `m` by `k` matrix of p-values for each source in each feature in `X` assuming `test == "marginal_conditional"` and `fast_mode == TRUE`; otherwise, an `m` by `1` matrix of the p-value for each feature in `X`.
`qvals`	An `m` by `k` matrix of q-values (FDR-adjusted p-values) for each source in each feature in `X` assuming `test == "marginal_conditional"` and `fast_mode == TRUE`; otherwise, an `m` by `1` matrix of the q-value for each feature in `X`. Note that if `test == "marginal_conditional"` and `fast_mode == TRUE` then q-values are calculated for each source separately.

References

Rahmani E, Schweiger R, Rhead B, Criswell LA, Barcellos LF, Eskin E, Rosset S, Sankararaman S, Halperin E. Cell-type-specific resolution epigenetics without the need for cell sorting or single-cell biology. Nature Communications 2019.

Examples

n <- 50
m <- 10
k <- 3
p1 <- 1
p2 <- 1
data <- test_data(n, m, k, p1, p2, 0.01)
tca.mdl <- tca(X = data$X, W = data$W, C1 = data$C1, C2 = data$C2)
y <- matrix(rexp(n, rate=.1), ncol=1)
rownames(y) <- rownames(data$W)
# marginal conditional test:
res0 <- tcareg(data$X, tca.mdl, y)
# joint test:
res1 <- tcareg(data$X, tca.mdl, y, test = "joint")
# custom test, testing for a joint effect of sources 1,2 while accounting for source 3
res2 <- tcareg(data$X, tca.mdl, y, test = "custom", null_model = c("3"),
alternative_model = c("1","2","3"))
# custom test, testing for a joint effect of sources 1,2 assuming no effects under the null
res3 <- tcareg(data$X, tca.mdl, y, test = "custom", null_model = NULL,
alternative_model = c("1","2"))

[Package TCA version 1.2.1 Index]