R: Markov chain Monte Carlo for Bayesian variable selection...

geomc.vs {geommc}

R Documentation

Markov chain Monte Carlo for Bayesian variable selection using a geometric MH algorithm.

Description

geomc.vs uses a geometric approach to MCMC for performing Bayesian variable selection. It produces MCMC samples from the posterior density of a Bayesian hierarchical feature selection model.

Usage

geomc.vs(
  X,
  y,
  initial = NULL,
  n.iter = 50,
  burnin = 1,
  eps = 0.5,
  symm = TRUE,
  move.prob = c(0.4, 0.4, 0.2),
  model.threshold = 0.5,
  lam = nrow(X)/ncol(X)^2,
  w = sqrt(nrow(X))/ncol(X)
)

Arguments

`X`	The `n\times p` covariate matrix without intercept. The following classes are supported: `matrix` and `dgCMatrix`. No need to center or scale this matrix manually. Scaling is performed implicitly and regression coefficients are returned on the original scale.
`y`	The response vector of length `n`. No need to center or scale.
`initial`	is the initial model (the set of active variables). Default: Null model.
`n.iter`	is the no. of samples needed. Default: 50.
`burnin`	is the value of burnin used to compute the median probability model. Default: 1.
`eps`	is the value for epsilon perturbation. Default: 0.5.
`symm`	indicates if the base density is of symmetric RW-MH. Default: True.
`move.prob`	is the vector of ('addition', 'deletion', 'swap') move probabilities. Default: (0.4,0.4,0.2). move.prob is used only when symm is set to False.
`model.threshold`	The threshold probability to select the covariates for the median model (median.model) and the weighted average model (wam). A covariate will be included in median.model (wam) if its marginal inclusion probability (weighted marginal inclusion probability) is greater than the threshold. Default: 0.5.
`lam`	The slab precision parameter. Default: `n/p^2` as suggested by the theoretical results of Li, Dutta, Roy (2023).
`w`	The prior inclusion probability of each variable. Default: `\sqrt{n}/p`.

Details

geomc.vs provides MCMC samples using the geometric MH algorithm of Roy (2024) for variable selection based on a hierarchical Gaussian linear model with priors placed on the regression coefficients as well as on the model space as follows:

y | X, \beta_0,\beta,\gamma,\sigma^2,w,\lambda \sim N(\beta_01 + X_\gamma\beta_\gamma,\sigma^2I_n)

\beta_i|\beta_0,\gamma,\sigma^2,w,\lambda \stackrel{indep.}{\sim} N(0, \gamma_i\sigma^2/\lambda),~i=1,\ldots,p,

(\beta_0,\sigma^2)|\gamma,w,p \sim p(\beta_0,\sigma^2) \propto 1/\sigma^2

\gamma_i|w,\lambda \stackrel{iid}{\sim} Bernoulli(w)

where X_\gamma is the n \times |\gamma| submatrix of X consisting of those columns of X for which \gamma_i=1 and similarly, \beta_\gamma is the |\gamma| subvector of \beta corresponding to \gamma. geomc.vs provides MCMC samples from the posterior pmf of the models P(\gamma|y), which is available up to a normalizing constant. geomc.vs also returns the marginal inclusion probabilities (mip) computed by the Monte Carlo average as well as the weighted marginal inclusion probabilities (wmip) computed with weights

w_i = P(\gamma^{(i)}|y)/\sum_{k=1}^K P(\gamma^{(k)}|y), i=1,2,...,K

where K is the number of distinct models sampled. Thus, based on the samples \gamma^{(k)}, k=1,2,...,n.iter mip for the jth variable is

mip_j = \sum_{k=1}^{n.iter} I(\gamma^{(k)}_j = 1)/n.iter

and wmip is as

wmip_j = \sum_{k=1}^K w_k I(\gamma^{(k)}_j = 1).

The median.model is the model containing variables j with mip_j > \code{model.threshold} and the wam is the model containing variables j with wmip_j > \code{model.threshold}. The conditional posterior mean of \beta given a model is available in closed form. geomc.vs returns the posterior means of \beta conditional on the median.model and the wam.

Value

A list with components

`samples`	MCMC samples from `P(\gamma\|y)` returned as `p \times`n.iter sparse `dgCMatrix`.
`acceptance.rate`	The acceptance rate based on all samples.
`mip`	The `p` vector of marginal inclusion probabilities of all variables based on post burnin samples.
`median.model`	The median probability model based on post burnin samples.
`beta.med`	The posterior mean of `\beta` (the `p+1` vector c(intercept, regression coefficients)) conditional on the median.model based on post burnin samples.
`wmip`	The `p` vector of weighted marginal inclusion probabilities of all variables based on post burnin samples.
`wam`	The weighted average model based on post burnin samples.
`beta.wam`	The posterior mean of `\beta` (the `p+1` vector c(intercept, regression coefficients)) conditional on the wam based on post burnin samples.
`log.post`	The n.iter vector of log of the unnormalized marginal posterior pmf `P(\gamma\|y)` evaluated at the samples.

Author(s)

Vivekananda Roy

References

Roy, V.(2024) A geometric approach to informative MCMC sampling https://arxiv.org/abs/2406.09010

Li, D., Dutta, S., Roy, V.(2023) Model Based Screening Embedded Bayesian Variable Selection for Ultra-high Dimensional Settings, Journal of Computational and Graphical Statistics, 32, 61-73

Examples

n=50; p=100; nonzero = 3
trueidx <- 1:3
nonzero.value <- 4
TrueBeta <- numeric(p)
TrueBeta[trueidx] <- nonzero.value
rho <- 0.5
xone <- matrix(rnorm(n*p), n, p)
X <- sqrt(1-rho)*xone + sqrt(rho)*rnorm(n)
y <- 0.5 + X %*% TrueBeta + rnorm(n)
result <- geomc.vs(X=X, y=y)
result$samples # the MCMC samples
result$acceptance.rate #the acceptance.rate
result$mip #marginal inclusion probabilities
result$wmip #weighted marginal inclusion probabilities
result$median.model #the median.model
result$wam #the weighted average model
result$beta.med #the posterior mean of regression coefficients for the median.model
result$beta.wam #the posterior mean of regression coefficients for the wam
result$log.post #the log (unnormalized) posterior probabilities of the MCMC samples.

[Package geommc version 0.0.1 Index]