R: Performs parametric statistical testing

association_parametric {Unico}

R Documentation

Performs parametric statistical testing

Description

Performs parametric statistical testing (T-test) on (1) the marginal effect of each covariate in C1 at source-specific level (2) the joint effect across all sources for each covariate in C1 (3) non-source-specific effect for each covariate in C2. In the context of bulk genomic data containing a mixture of cell types, these correspond to the marginal effect of each covariate in C1 (potentially including the phenotype of interest) at each cell type, joint tissue-level effect for each covariate in C1, and tissue-level effect for each covariate in C2.

Usage

association_parametric(
  X,
  Unico.mdl,
  slot_name = "parametric",
  diag_only = FALSE,
  intercept = TRUE,
  X_max_stds = 2,
  Q_max_stds = Inf,
  XQ_max_stds = Inf,
  parallel = TRUE,
  num_cores = NULL,
  log_file = "Unico.log",
  verbose = FALSE,
  debug = FALSE
)

Arguments

`X`	An `m` by `n` matrix of measurements of `m` features for `n` observations. Each column in `X` is assumed to be a mixture of `k` sources. Note that `X` must include row names and column names and that NA values are currently not supported. `X` should not include features that are constant across all observations. Note that `X` must be the same `X` used to learn `Unico.mdl` (i.e. the original observed 2D mixture used to fit the model).
`Unico.mdl`	The entire set of model parameters estimated by Unico on the 2D mixture matrix (i.e. the list returned by applying function `Unico` to `X`).
`slot_name`	A string indicating the key for storing the results under `Unico.mdl`
`diag_only`	A logical value indicating whether to only use the estimated source-level variances (and thus ignoring the estimate covariance) for controlling the heterogeneity in the observed mixture. if set to FALSE, Unico instead estimates the observation- and feature-specific variance in the mixture by leveraging the entire `k` by `k` variance-covariance matrix.
`intercept`	A logical value indicating whether to fit the intercept term when performing the statistical testing.
`X_max_stds`	A non-negative numeric value indicating, for each feature, the portions of data that are considered as outliers due to the observed mixture value. Only samples whose observed mixture value fall within `X_max_stds` standard deviations from the mean will be used for the statistical testing of a given feature.
`Q_max_stds`	A non-negative numeric value indicating, for each feature, the portions of data that are considered as outliers due to the estimated mixture variance. Only samples whose estimated mixture variance fall within `Q_max_stds` standard deviations from the mean will be used for the statistical testing of a given feature.
`XQ_max_stds`	A non-negative numeric value indicating, for each feature, the portions of data that are considered as outliers due to the weighted mixture value. Only samples whose weighted mixture value fall within `XQ_max_stds` standard deviations from the mean will be used for the statistical testing of a given feature.
`parallel`	A logical value indicating whether to use parallel computing (possible when using a multi-core machine).
`num_cores`	A numeric value indicating the number of cores to use (activated only if `parallel == TRUE`). If `num_cores == NULL` then all available cores except for one will be used.
`log_file`	A path to an output log file. Note that if the file `log_file` already exists then logs will be appended to the end of the file. Set `log_file` to `NULL` to prevent output from being saved into a file; note that if `verbose == FALSE` then no output file will be generated regardless of the value of `log_file`.
`verbose`	A logical value indicating whether to print logs.
`debug`	A logical value indicating whether to set the logger to a more detailed debug level; set `debug` to `TRUE` before reporting issues.

Details

If we assume that source-specific values Z_{ijh} are normally distributed, under the Unico model, we have the following:

Z_{ij} \sim \mathcal{N}\left(\mu_{j} + (c_i^{(1)})^T \gamma_{jh}, \sigma_{jh}^2 \right)

X_{ij} \sim \mathcal{N}\left(w_{i}^T (\mu_{j} + (c_i^{(1)})^T \gamma_{jh}) + (c_i^{(2)})^T \beta_j, \text{Sum}\left((w_i w_i^T ) \odot \Sigma_j\right) + \tau_j^2\right)

For a given feature j under test, the above equation corresponds to a heteroskedastic regression problem with X_{ij} as the dependent variable and \{\{w_i\}, \{w_i c_i^{(1)}\}, \{c_i^{(2)}\}\} as the set of independent variables. This view allows us to perform parametric statistical testing (T-test for marginal effects and partial F-test for joint effects) by solving a generalized least squares problem with sample i scaled by the inverse of its estimated standard deviation.

Value

An updated Unico.mdl object with the the following list of effect size and p-value estimates stored in an additional key specified by slot_name

`gammas_hat`	An `m` by `k*p1` matrix of the estimated effects of the `p1` covariates in `C1` on each of the `m` features in `X`, where the first `p1` columns are the source-specific effects of the `p1` covariates on the first source, the following `p1` columns are the source-specific effects on the second source and so on.
`betas_hat`	An `m` by `p2` matrix of the estimated effects of the `p2` covariates in `C2` on the mixture values of each of the `m` features in `X`.
`gammas_hat_pvals`	An `m` by `k*p1` matrix of p-values for the estimates in `gammas_hat` (based on a T-test).
`betas_hat_pvals`	An `m` by `p2` matrix of p-values for the estimates in `betas_hat` (based on a T-test).
`gammas_hat_pvals.joint`	An `m` by `p1` matrix of p-values for the joint effects (i.e. across all `k` sources) of each of the `p1` covariates in `C1` on each of the `m` features in `X` (based on a partial F-test). In other words, these are p-values for the combined statistical effects (across all sources) of each one of the `p1` covariates on each of the `m` features under the Unico model.
`Q`	An `m` by `n` matrix of weights used for controlling the heterogeneity of each observation at each feature (activated only if `debug == TRUE`).
`masks`	An `m` by `n` matrix of logical values indicating whether observation participated in statistical testing at each feature (activated only if `debug == TRUE`).
`phi_hat`	An `m` by `k+p1*k+p2` matrix containing the entire estimated effect sizes (including those on source weights) for each feature (activated only if `debug == TRUE`).
`phi_se`	An `m` by `k+p1*k+p2` matrix containing the estimated standard errors associated with `phi_hat` for each feature (activated only if `debug == TRUE`).
`phi_hat_pvals`	An `m` by `k+p1*k+p2` matrix containing the p-values associated with `phi_hat` for each feature (activated only if `debug == TRUE`).

Examples

data = simulate_data(n=100, m=2, k=3, p1=1, p2=1, taus_std=0, log_file=NULL)
res = list()
res$params.hat = Unico(data$X, data$W, data$C1, data$C2, parallel=FALSE, log_file=NULL)
res$params.hat = association_parametric(data$X, res$params.hat, parallel=FALSE, log_file=NULL)

[Package Unico version 0.1.0 Index]