bgsmtr {bgsmtr}R Documentation

Bayesian Group Sparse Multi-Task Regression for Imaging Genetics

Description

Runs the the Gibbs sampling algorithm to fit a Bayesian group sparse multi-task regression model. Tuning parameters can be chosen using either the MCMC samples and the WAIC (multiple runs) or using an approximation to the posterior mode and five-fold cross-validation (single run).

Usage

bgsmtr(
  X,
  Y,
  group,
  tuning = "CV.mode",
  lam_1_fixed = NULL,
  lam_2_fixed = NULL,
  iter_num = 10000,
  burn_in = 5001
)

Arguments

X

A d-by-n matrix; d is the number of SNPs and n is the number of subjects. Each row of X should correspond to a particular SNP and each column should correspond to a particular subject. Each element of X should give the number of minor alleles for the corresponding SNP and subject. The function will center each row of X to have mean zero prior to running the Gibbs sampling algorithm.

Y

A c-by-n matrix; c is the number of phenotypes (brain imaging measures) and n is the number of subjects. Each row of Y should correspond to a particular phenotype and each column should correspond to a particular subject. Each element of Y should give the measured value for the corresponding phentoype and subject. The function will center and scale each row of Y to have mean zero and unit variance prior to running the Gibbs sampling algorithm.

group

A vector of length d; d is the number of SNPs. Each element of this vector is a string representing a gene or group label associated with each SNP. The SNPs represented by this vector should be ordered according to the rows of X.

tuning

A string, either 'WAIC' or 'CV.mode'. If 'WAIC', the Gibbs sampler is run with fixed values of the tuning parameters specified by the arguments lam_1_fixed and lam_2_fixed and the WAIC is computed based on the sampling output. This can then be used to choose optimal values for lam_1_fixed and lam_2_fixed based on multiple runs with each run using different values of lam_1_fixed and lam_2_fixed. This option is best suited for either comparing a small set of tuning parameter values or for computation on a high performance computing cluster where different nodes can be used to run the function with different values of lam_1_fixed and lam_2_fixed. Posterior inference is then based on the run that produces the lowest value for the WAIC. The option 'CV.mode', which is the default, is best suited for computation using just a single processor. In this case the tuning parameters are chosen based on five-fold cross-validation over a grid of possible values with out-of-sample prediction based on an approximate posterior mode. The Gibbs sampler is then run using the chosen values of the tuning parameters. When tuning = 'CV.mode' the values for the arguments lam_1_fixed and lam_2_fixed are not required.

lam_1_fixed

Only required if tuning = 'WAIC'. A positive number giving the value for the gene-specific tuning parameter. Larger values lead to a larger degree of shrinkage to zero of estimated regression coefficients at the gene level (across all SNPs and phenotypes).

lam_2_fixed

Only required if tuning = 'WAIC'. A positive number giving the value for the SNP-specific tuning parameter. Larger values lead to a larger degree of shrinkage to zero of estimated regression coefficients at the SNP level (across all phenotypes).

iter_num

Positive integer representing the total number of iterations to run the Gibbs sampler. Defaults to 10,000.

burn_in

Nonnegative integer representing the number of MCMC samples to discard as burn-in. Defaults to 5001.

Value

A list with the elements

WAIC

If tuning = 'WAIC' this is the value of the WAIC computed from the MCMC output. If tuning = 'CV.mode' this component is excluded.

Gibbs_setup

A list providing values for the input parameters of the function.

Gibbs_W_summaries

A list with five components, each component being a d-by-c matrix giving some posterior summary of the regression parameter matrix W, where the ij-th element of W represents the association between the i-th SNP and j-th phenotype.

-Gibbs_W_summaries$W_post_mean is a d-by-c matrix giving the posterior mean of W.

-Gibbs_W_summaries$W_post_mode is a d-by-c matrix giving the posterior mode of W.

-Gibbs_W_summaries$W_post_sd is a d-by-c matrix giving the posterior standard deviation for each element of W.

-Gibbs_W_summaries$W_2.5_quantile is a d-by-c matrix giving the posterior 2.5 percent quantile for each element of W.

-Gibbs_W_summaries$W_97.5_quantile is a d-by-c matrix giving the posterior 97.5 percent quantile for each element of W.'

Author(s)

Farouk S. Nathoo, nathoo@uvic.ca

Keelin Greenlaw, keelingreenlaw@gmail.com

Mary Lesperance, mlespera@uvic.ca

References

Greenlaw, Keelin, Elena Szefer, Jinko Graham, Mary Lesperance, and Farouk S. Nathoo. "A Bayesian Group Sparse Multi-Task Regression Model for Imaging Genetics." arXiv preprint arXiv:1605.02234 (2016).

Nathoo, Farouk S., Keelin Greenlaw, and Mary Lesperance. "Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics." arXiv preprint arXiv:1603.08163 (2016).

Examples

data(bgsmtr_example_data)
names(bgsmtr_example_data)


## Not run: 
## test run the sampler for 100 iterations with fixed tunning parameters and compute WAIC
## we recomend at least 5,000 iterations for actual use
fit = bgsmtr(X = bgsmtr_example_data$SNP_data, Y = bgsmtr_example_data$BrainMeasures,
group = bgsmtr_example_data$SNP_groups, tuning = 'WAIC', lam_1_fixed = 2, lam_2_fixed = 2,
iter_num = 100, burn_in = 50)
## posterior mean for regression parameter relating 100th SNP to 14th phenotype
fit$Gibbs_W_summaries$W_post_mean[100,14]
## posterior mode for regression parameter relating 100th SNP to 14th phenotype
fit$Gibbs_W_summaries$W_post_mode[100,14]
## posterior standard deviation for regression parameter relating 100th SNP to 14th phenotype
fit$Gibbs_W_summaries$W_post_sd[100,14]
## 95
c(fit$Gibbs_W_summaries$W_2.5_quantile[100,14],fit$Gibbs_W_summaries$W_97.5_quantile[100,14])

## End(Not run)

## Not run: 
## run the sampler for 10,000 iterations with tuning parameters set using cross-validation
## On a standard computer with a small numer of cores this is the recomended option
fit = bgsmtr(X = bgsmtr_example_data$SNP_data, Y = bgsmtr_example_data$BrainMeasures,
group = bgsmtr_example_data$SNP_groups, tuning = 'CV.mode',iter_num = 10000, burn_in = 5000)

## End(Not run)


[Package bgsmtr version 0.7 Index]