R: Sample From Posterior Distributions

ptycho {ptycho}

R Documentation

Sample From Posterior Distributions

Description

Generate MCMC samples from posterior distribution. Two interfaces are provided: ptycho generates samples for one design matrix and response matrix while ptycho.all runs in batch an object generated by createData.

Usage

ptycho(X, y, initStates, groups = NULL,
       tau.min = 0.01, tau.max = 10, tau.sd = (tau.max - tau.min)/4,
       doGPrior = TRUE, doDetPrior = FALSE, prob.varadd = 0.5,
       isOmegaFixed = FALSE, omega = NULL, omega.grp = NULL,
       probs.grp = NULL, rho.alpha = 10, rho.lambda = rho.alpha,
       only.means = FALSE, nburn = 0, nthin = 1, nSavePerChain,
       parallel.chains=FALSE, random.seed=NULL)
ptycho.all(data, across=c("none","traits","sites"), doGrpIndicator,
           dir.out, nreplicates=NULL, parallel.replicates=FALSE,
           doSetSeed=TRUE, ncolumns=NULL, ...)

Arguments

`X`	`n`-by-`p` design matrix
`y`	`n`-by-`q` matrix containing response(s)
`initStates`	List containing initial states for chains. Each state is a list with components: `indic.var` `p`-by-`q` logical matrix. If `(j,k)` entry is `TRUE`, then covariate `j` is initially in the model for response `k`. `tau` Scalar `indic.grp` Logical vector of length equal to the number of groups; analogous to `indic.var`; `NULL` to use priors that do not incorporate a second-level indicator variable
`groups`	To combine information across variants, list containing `var2group` Integer vector of length `p`, with entry `j` being the index of the group containing covariate `j` `group2var` List of length `G`, each entry of which is an integer vector containing the indices of the covariates belonging to that group `sizes` Vector of length `G` containing the number of covariates in each group Otherwise, `NULL`.
`tau.min`, `tau.max`	Endpoints of uniform prior distribution on `tau`
`tau.sd`	Standard deviation of the Metropolis-Hastings proposal distribution for `tau`
`doGPrior`	Logical indicating whether to use the g-prior for effect sizes
`doDetPrior`	Unsupported; use default value
`prob.varadd`	If `initStates[[1]]]$indic.grp` is `NULL`, the probability that the Metropolis-Hastings proposal changes one entry of `indic.var` from `FALSE` to `TRUE`. Otherwise, the probability of this event given that the proposal does not change `indic.grp`.
`isOmegaFixed`	Logical indicating whether `omega` is known
`omega`	If `isOmegaFixed` is `TRUE`, a `p`-by-`q` matrix containing the known probabilities. Otherwise, a matrix containing the parameters for the Beta prior distribution on `omega`. Such a matrix has columns “A” and “B”; the number of rows should be: 1 if `q=1` and `initStates[[1]]$indic.grp` is `NULL`, `length(groups$group2var)` if that is nonzero, or `p` otherwise. If `omega` is `NULL` and `isOmegaFixed` is `FALSE`, defaults to uniform priors.
`omega.grp`	If `isOmegaFixed` is `TRUE`, the known probability that entries in `indic.grp` are `TRUE`. Otherwise, a vector with names “A” and “B” containing the parameters for the Beta prior distribution on `omega.grp`. If `NULL`, defaults to uniform priors. Unused if `initStates[[1]]$indic.grp` is `NULL`.
`probs.grp`	Vector containing the probabilities that the Metropolis-Hastings proposal will add, leave unchanged, or remove, respectively, a group. If `NULL`, defaults to `c(0.25,0.5,0.25)`. Unused if `initStates[[1]]$indic.grp` is `NULL`.
`rho.alpha`, `rho.lambda`	Parameters for the Gamma prior distribution on `\rho`, which is the precision of the noise. Here, the Gamma`(\alpha,\lambda)` distribution has density function proportional to `x^\alpha e^{-\lambda x}`.
`only.means`	If logical, specifies whether to return samples or the running means of the samples. Can also be a vector containing the iterations (after the burn-in interval) at which to save the means.
`nburn`	Number of MCMC samples to make before starting to save samples or to compute means
`nthin`	Interval between saved samples; default value 1 saves all samples. Unused if `only.means` is `TRUE` or a vector.
`nSavePerChain`	If `only.means` is `FALSE`, number of MCMC samples to return from each chain, which means a total of `nthin * nSavePerChain + nburn` samples are drawn per chain. If `only.means` is `TRUE`, then `nSavePerChain + nburn` samples are drawn, and only the averages of the last `nSavePerChain` samples are returned. Unused if `only.means` is not a logical.
`parallel.chains`	Logical indicating whether to run chains in parallel; see Details.
`random.seed`	Random seed to pass to chain iterator; if `NULL`, the random seed is not set. See Details.
`data`	Data in format output by `createData`
`across`	Whether to combine information across traits, sites, or neither
`doGrpIndicator`	Whether to use priors that incorporate `indic.grp`
`dir.out`	Directory to which to `save` samples or means
`nreplicates`	Vector of replicates to run; if `NULL`, all will be run
`parallel.replicates`	Logical indicating whether to run replicates in parallel; see Details.
`doSetSeed`	If `TRUE`, call `set.seed(n.repl)` before running samples.
`ncolumns`	Scalar. If `across` is “none” or “sites”, each of the first `ncolumns` of `repl$y` will be used in turn, running all columns by default. Ignored if `across` is “sites”.
`...`	Additional arguments passed to `ptycho`

Details

These functions run MCMC sampling from the posterior of the linear regression models using hierarchical priors described in Stell and Sabatti (2015). The function ptycho.all is a wrapper of ptycho to simplify running the simulation experiments over many replicates. These functions determine which priors to use as follows:

Standard spike and slab priors that do not combine information (basic)
For ptycho.all, argument across is “none” and doGrpIndicator is FALSE.
For ptycho, argument y has one column, groups is NULL, and indic.grp is NULL or missing in each entry of initStates.
Combine information across traits (Across Traits)
For ptycho.all, argument across is “traits” and doGrpIndicator is TRUE.
For ptycho, argument y has p > 1 columns, groups is NULL, and indic.grp is a logical vector of length p in each entry of initStates.
Combine information across variants (Across Sites)
For ptycho.all, argument across is “sites” and doGrpIndicator is TRUE.
For ptycho, argument y has one column, groups specifies how to combine information, and indic.grp in each entry of initStates is a logical vector of the same length as groups$group2var.
Combine information across traits incorrectly (Unadjusted)
For ptycho.all, argument across is “traits” and doGrpIndicator is FALSE.
For ptycho, argument y has p > 1 columns, groups is NULL, and indic.grp is NULL in each entry of initStates.
This prior does not properly correct for multiple hypothesis testing and is only included because it is needed to reproduce results in Stell and Sabatti (2015).

Combining information across both phenotypes and variants is planned for a future release. These functions perform some checks for compatibility of X, y, groups, and initStates; but invalid input could lead to unpredictable behavior. Singular X can result in an error; even strongly correlated covariates can cause difficulties as described by Stell and Sabatti (2015).

The simplest way to run the simulations in Stell and Sabatti (2015) is, for example,

  data <- createPubData("pleiotropy")
  ptycho.all(data=data, across="traits", doGrpIndicator=TRUE,
             dir.out="/path/to/output/dir/",
             only.means=50000*(1:10), nburn=10000)
  ptycho.all(data=data, across="sites", doGrpIndicator=TRUE,
             dir.out="/path/to/another/dir/",
             groups=createGroupsSim(G=10, ncol(data$X)),
             only.means=50000*(1:10), nburn=10000)

With these calls, the replicates run sequentially and so do the chains of the MCMC sampler; the results will be reproducible because the random seed is set for each replicate.

Parallelization is implemented via the foreach package. The user must not only have it installed but also an appropriate parallel backend, which must be registered. To run chains in parallel using the doMC, for example,

  data(ptychoIn)
  G <- 2; p <- ncol(ptychoIn$X)
  groups <- createGroupsSim(G, p)
  state <- list(list(indic.grp=rep(FALSE,G),
                     indic.var=matrix(FALSE,nrow=p,ncol=1), tau=1),
                list(indic.grp=rep(TRUE,G),
                     indic.var=matrix(TRUE,nrow=p,ncol=1), tau=1))
  require(doMC)
  registerDoMC(length(state))
  ptychoOut <- ptycho(X=ptychoIn$X, y=ptychoIn$replicates[[1]]$y,
                      groups=groups, initStates=state,
                      only.means=100*seq_len(5), parallel.chains=TRUE)

The results would not be reproducible, however, even if one set the random seed before calling ptycho. For reproducible results, pass the random seed in the call to ptycho, which requires that the doRNG package is also installed. Running the chains in parallel when calling ptycho.all also requires the option parallel.chains=TRUE, which uses doRNG unless doSetSeed=FALSE. By default, one of the chains starts with all variants in the model, so that chain takes much longer to run than do the other chains. Consequently, when running multiple replicates via ptycho.all, much greater time savings can be achieved by running the replicates in parallel with, for example,

  data <- createPubData("pleiotropy")
  require(doMC)
  registerDoMC(8)
  ptycho.all(data=data, across="traits", doGrpIndicator=TRUE,
             dir.out="/path/to/output/dir/",
             only.means=50000*(1:10), nburn=10000)

In this case, the default behavior of reproducible results does not require doRNG because the seed is set after each parallel worker is created.

We conclude this description with a discussion of the running time of the MCMC sampler. Our actual data has 5335 subjects, 764 variants and three traits. An mcmc.list containing 50,000 samples for each of four chains can take about 5~GB. Running chains in parallel, it takes less than an hour (on a Linux computer with 2.6 GHz processors) to perform 510,000 samples per chain. The run time depends primarily on the number of entries that are TRUE in the sampled indic.var matrices; increasing this will increase run times. A chain that initially has all entries of indic.var set to TRUE will take longer than one where the model is initially empty. Priors that inflate the posterior expectation of indic.var[j,k] (such as combining information across responses without using indic.grp) will also take longer.

Value

The results of ptycho.all are written to files by save. For priors that use only one response, the output for replicate r and column c will be written to ‘rpl<r>col<c>.Rdata’ in the directory specified by dir.out. For priors that use multiple responses, ptycho is called only once for each replicate, and the file name will be ‘rpl<r>col1.Rdata’. The object in each such file has the name smpl and is the value of a call to ptycho. The format of these objects depends upon the argument only.means. In all cases, however, it has attribute params set to a list containing most of the arguments in the call to ptycho.

If only.means is FALSE, then ptycho returns an mcmc.list whose length is the same as the length of initStates. Each entry in this list is an mcmc object with nSavePerChain rows and a column for each entry of indic.var and indic.grp plus a column for tau.

Otherwise, ptycho returns an object of class ptycho, which is actually a matrix. The matrix has a column for each sampled indicator variable, for tau and its square (so that its variance can be computed), and for the chain and iteration numbers. If only.means is TRUE, then each row contains the means of the samples in one chain and there will be length(initStates) * nSavePerChain rows. If only.means is a vector, then there will be length(initStates) * length(only.means) rows.

Author(s)

Laurel Stell and Chiara Sabatti
Maintainer: Laurel Stell <lstell@stanford.edu>

References

Stell, L. and Sabatti, C. (2015) Genetic variant selection: learning across traits and sites, arXiv:1504.00946.

Examples

data(tinysim)
# Use replicate 4.
X <- tinysim$X; p <- ncol(X); nr <- 4
# COMBINE INFORMATION ACROSS RESPONSES
Y <- tinysim$replicates[[nr]]$y; q <- ncol(Y)
# Run 2 chains.
state <- list(list(indic.grp=rep(FALSE,p),
                   indic.var=matrix(FALSE,nrow=p,ncol=q), tau=1),
              list(indic.grp=rep(TRUE,p),
                   indic.var=matrix(TRUE,nrow=p,ncol=q), tau=1))
# In each chain, discard first 10 burn-in samples, then generate
# 100 samples and save running means after every 20 samples.
smpl.ph <- ptycho(X=X, y=Y, initStates=state, only.means=20*(1:5),
                  nburn=10)
# COMBINE INFORMATION ACROSS VARIANTS
# Use two groups of variants.
G <- 2; groups <- createGroupsSim(G, p)
# Run 2 chains.
state <- list(list(indic.grp=rep(FALSE,G),
                   indic.var=matrix(FALSE,nrow=p,ncol=1), tau=1),
              list(indic.grp=rep(TRUE,G),
                   indic.var=matrix(TRUE,nrow=p,ncol=1), tau=1))
# Use response 3.
y <- tinysim$replicates[[nr]]$y[,3,drop=FALSE]
smpl.var <- ptycho(X=X, y=y, groups=groups, initStates=state,
                   only.means=c(20*(1:5)), nburn=10, nthin=1)

[Package ptycho version 1.1-4 Index]