R: Fitting the Unico model

Unico {Unico}

R Documentation

Fitting the Unico model

Description

Fits the Unico model for an input matrix of features by observations that are coming from a mixture of k sources, under the assumption that each observation is a mixture of unique (unobserved) source-specific values (in each feature in the data). Specifically, for each feature, it standardizes the data and learns the source-specific mean and full k by k variance-covariance matrix.

Usage

Unico(
  X,
  W,
  C1,
  C2,
  fit_tau = FALSE,
  mean_penalty = 0,
  var_penalty = 0.01,
  covar_penalty = 0.01,
  mean_max_iterations = 2,
  var_max_iterations = 3,
  nloptr_opts_algorithm = "NLOPT_LN_COBYLA",
  max_stds = 2,
  init_weight = "default",
  max_u = 1,
  max_v = 1,
  parallel = TRUE,
  num_cores = NULL,
  log_file = "Unico.log",
  verbose = FALSE,
  debug = FALSE
)

Arguments

`X`	An `m` by `n` matrix of measurements of `m` features for `n` observations. Each column in `X` is assumed to be a mixture of `k` sources. Note that `X` must include row names and column names and that NA values are currently not supported. `X` should not include features that are constant across all observations.
`W`	An `n` by `k` matrix of weights - the weights of `k` sources for each of the `n` mixtures (observations). All the weights must be positive and each row - corresponding to the weights of a single observation - must sum up to 1. Note that `W` must include row names and column names and that NA values are currently not supported.
`C1`	An `n` by `p1` design matrix of covariates that may affect the hidden source-specific values (possibly a different effect size in each source). Note that `C1` must include row names and column names and should not include an intercept term. NA values are currently not supported. Note that each covariate in `C1` results in `k` additional parameters in the model of each feature, therefore, in order to alleviate the possibility of model overfitting, it is advised to be mindful of the balance between the size of `C1` and the sample size in `X`.
`C2`	An `n` by `p2` design matrix of covariates that may affect the mixture (i.e. rather than directly the sources of the mixture; for example, variables that capture biases in the collection of the measurements). Note that `C2` must include row names and column names and should not include an intercept term. NA values are currently not supported.
`fit_tau`	A logical value indicating whether to fit the standard deviation of the measurement noise (i.e. the i.i.d. component of variation in the model denoted as `\tau`).
`mean_penalty`	A non-negative numeric value indicating the regularization strength on the source-specific mean estimates.
`var_penalty`	A non-negative numeric value indicating the regularization strength on the diagonal entries of the full `k` by `k` variance-covariance matrix.
`covar_penalty`	A non-negative numeric value indicating the regularization strength on the off diagonal entries of the full `k` by `k` variance-covariance matrix.
`mean_max_iterations`	A non-negative numeric value indicating the number of iterative updates performed on the mean estimates.
`var_max_iterations`	A non-negative numeric value indicating the number of iterative updates performed on the variance-covariance matrix.
`nloptr_opts_algorithm`	A string indicating the optimization algorithm to use.
`max_stds`	A non-negative numeric value indicating, for each feature, the portions of data that are considered as outliers. Only samples within `max_stds` standard deviations from the mean will be used for the moments estimation of a given feature.
`init_weight`	A string indicating the initial weights on the samples to start the iterative optimization.
`max_u`	A non-negative numeric value indicating the maximum weights/influence a sample can have on mean estimates.
`max_v`	A non-negative numeric value indicating the maximum weights/influence a sample can have on variance-covariance estimates.
`parallel`	A logical value indicating whether to use parallel computing (possible when using a multi-core machine).
`num_cores`	A numeric value indicating the number of cores to use (activated only if `parallel == TRUE`). If `num_cores == NULL` then all available cores except for one will be used.
`log_file`	A path to an output log file. Note that if the file `log_file` already exists then logs will be appended to the end of the file. Set `log_file` to `NULL` to prevent output from being saved into a file; note that if `verbose == FALSE` then no output file will be generated regardless of the value of `log_file`.
`verbose`	A logical value indicating whether to print logs.
`debug`	A logical value indicating whether to set the logger to a more detailed debug level; set `debug` to `TRUE` before reporting issues.

Details

Unico assumes the following model:

X_{ij} = w_{i}^T Z_{ij} +(c_i^{(2)})^T \beta_j+ e_{ij}

The mixture value at sample i feature j: X_{ij} is modeled as a weighted linear combination, specified by weights w_i = (w_{i1},...,w_{ik}), of a total of k source-specific levels, specified by Z_{ij}=(Z_{ij1},...,Z_{ijk}). In addition, we also consider global-level covariates c_i^{(2)} that systematically affect the observed mixture values and their effect sizes \beta_j. e_{ij} denotes the i.i.d measurement noise with variance \tau across all samples. Weights have be to non-negative and sum up to 1 across all sources for each sample. In practice, we assume that the weights are fixed and estimated by external methods.

Source specific profiles are further modeled as:

Z_{ijh} = \mu_{jh} + (c_i^{(1)})^T \gamma_{jh} + \epsilon_{ijh}

\mu_{jh} denotes the population level mean of feature j at source h. We also consider covariates c_i^{(1)} that systematically affect the source-specific values and their effect sizes \gamma_{jh} on each source. Finally, we actively model the k by k covariance structure of a given feature j across all k sources Var[\vec{\epsilon_{ij}}] = \Sigma_{j} \in \mathbf{R}^{k \times k}.

Value

A list with the estimated parameters of the model. This list can be then used as the input to other functions such as tensor.

`W`	An `n` by `k` matrix of weights. This is the same as `W` from input.
`C1`	An `n` by `p1` design matrix of source-specific covariates. This is the same as `C1` from input.
`C2`	An `n` by `p2` design matrix of not source-specific covariates. This is the same as `C2` from input.
`mus_hat`	An `m` by `k` matrix of estimates for the mean of each source in each feature.
`gammas_hat`	An `m` by `k*p1` matrix of the estimated effects of the `p1` covariates in `C1` on each of the `m` features in `X`, where the first `p1` columns are the source-specific effects of the `p1` covariates on the first source, the following `p1` columns are the source-specific effects on the second source and so on.
`betas_hat`	An `m` by `p2` matrix of the estimated effects of the `p2` covariates in `C2` on the mixture values of each of the `m` features in `X`.
`sigmas_hat`	An `m` by `k` by `k` tensor of estimates for the cross source `k` by `k` variance-covariance matrix in each feature.
`taus_hat`	An `m` by `1` matrix of estimates for the variance of the measurement noise.
`scale.factor`	An `m` by `1` matrix of scaling factors for standardizing each feature.
`config`	A list with hyper-parameters used for fitting the model and configurations for in the optimization algorithm.
`Us_hat_list`	A list tracking, for each feature, the sample weights used for each iteration of the mean optimization (activated only if `debug == TRUE`).
`Vs_hat_list`	A list tracking, for each feature, the sample weights used for each iteration of the variance-covariance optimization (activated only if `debug == TRUE`).
`Ls_hat_list`	A list tracking, for each feature, the computed estimates of the upper triangular cholesky decomposition of variance-covariance matrix at each iteration of the variance-covariance optimization (activated only if `debug == TRUE`).
`sigmas_hat_list`	A list tracking, for each feature, the computed estimates of the variance-covariance matrix at each iteration of the variance-covariance optimization (activated only if `debug == TRUE`).

Examples

data = simulate_data(n=100, m=2, k=3, p1=1, p2=1, taus_std=0, log_file=NULL)
res = list()
res$params.hat = Unico(data$X, data$W, data$C1, data$C2, parallel=FALSE, log_file=NULL)

[Package Unico version 0.1.0 Index]