R: Bayesian Variable Selection with Factors for linear...

GibbsBvsF {BayesVarSel}

R Documentation

Bayesian Variable Selection with Factors for linear regression models using Gibbs sampling.

Description

Numerical and factor variable selection from a Bayesian perspective. The posterior distribution is approximated with Gibbs sampling

Usage

GibbsBvsF(
  formula,
  data,
  null.model = paste(as.formula(formula)[[2]], " ~ 1", sep = ""),
  prior.betas = "Robust",
  prior.models = "SBSB",
  n.iter = 10000,
  init.model = "Full",
  n.burnin = 500,
  n.thin = 1,
  time.test = TRUE,
  seed = runif(1, 0, 16091956)
)

Arguments

`formula`	Formula defining the most complex linear model in the analysis. See details.
`data`	data frame containing the data.
`null.model`	A formula defining which is the simplest (null) model. It should be nested in the full model. It is compulsory that the null model contains the intercept and by default, the null model is defined to be the one with just the intercept
`prior.betas`	Prior distribution for regression parameters within each model (to be literally specified). Possible choices include "Robust", "Robust.G", "Liangetal", "gZellner", "ZellnerSiow" (see details in `Bvs`).
`prior.models`	Prior distribution over the model space (to be literally specified). Possible choices (see details) are "Const", "SB", "ConstConst", "SBConst" and "SBSB" (the default).
`n.iter`	The total number of iterations performed after the burn in process.
`init.model`	The model at which the simulation process starts. Options include "Null" (the model only with the covariates specified in `fixed.cov`), "Full" (the model defined by `formula`), "Random" (a randomly selected model) and a vector with (pnum+sum_j L_j) zeros and ones defining a model.
`n.burnin`	Length of burn in, i.e. number of iterations to discard at the beginning.
`n.thin`	Thinning rate. Must be a positive integer. Set 'n.thin' > 1 to save memory and computation time if 'n.iter' is large. Default is 1. This parameter jointly with `n.iter` sets the number of simulations kept and used to construct the estimates so is important to keep in mind that a large value for 'n.thin' can reduce the precision of the results
`time.test`	If TRUE and the number of variables is large (>=21) a preliminary test to estimate computational time is performed.
`seed`	A seed to initialize the random number generator

Details

In practical terms, GibbsBvsF can be understood as a version of GibbsBvs in the presence of factors. The methodology implemented in GibbsBvsF to handle variable selection problems with factors has been proposed in Garcia-Donato and Paulo (2018) leading to a method for which results do not depend on how the factors are coded (eg. via contrast).

Internally, a rank defficient representation of factors using dummies is used and the number of competing models considered is

2^(pnum+sum_j L_j),

where pnum is the number of numerical variables and L_j is the number of levels in factor j.

A main difference with Bvs and GibbsBvs (due to the presence of factors) concerns the prior probabilities on the model space:

The options prior.models="SBSB", prior.models="ConstConst" and prior.models="SBConst" acknowledge the "grouped" nature of the dummy variables representing factors through the use of two stage priors described in Garcia-Donato and Paulo (2021). In the first stage probabilities over factors and numerical variables are specified and (conditional on these) within the second stage the probablities are apportioned over the different submodels defined by the dummies. The default (and recommended, for the reasons argued in Garcia-Donato and Paulo,2021) option is "SBSB" which uses in both stages an assignment of the type Scott-Berger so inversely proportional to the number of models of the same dimension. The option "ConstConst" implements a uniform prior for both stages while "SBConst" uses a Scott-Berger prior in the first stage and it is uniform in the second stage. Within all these priors, the prior inclusion probabilities of factors and numerical variables are 1/2.

The options prior.models="Const" and prior.models="SB" do not have a staged structure and "Const" apportions the prior probabilities uniformly over all possible models (2^(pnum+sum_j L_j)) and in "SB" the probability is inversely proportional to the number of any model of the same dimension. In these cases, prior inclusion probabilities of factors and numerical variables depend on the number of levels of factors and, in general, are not 1/2.

Value

GibbsBvsF returns an object of class Bvs with the following elements:

`time`	The internal time consumed in solving the problem
`lmfull`	The `lm` class object that results when the model defined by `formula` is fitted by `lm`
`lmnull`	The `lm` class object that results when the model defined by `fixed.cov` is fitted by `lm`
`variables`	The name of all the potential explanatory variables (numerical or factors)
`n`	Number of observations
`p`	Number of explanatory variables (both numerical and factors) to select from
`k`	Number of fixed variables
`HPMbin`	The binary expression of the most probable model found.
`inclprob`	A named vector with the estimates of the inclusion probabilities of all the variables.
`jointinclprob`	A `data.frame` with the estimates of the joint inclusion probabilities of all the variables.
`postprobdim`	Estimates of posterior probabilities of the number of active variables in the true model (hence ranking from `k` to `k+p`).
`modelslogBF`	A matrix with both the binary representation of the active variables in the MCMC after the burning period and the Bayes factor (log scale) of that model to the null model.
`modelswllogBF`	A matrix with both the binary representation of the active variables (at the level of the levels in the factors) in the MCMC after the burning period and the Bayes factor (log scale) of that model to the null model.
`call`	The `call` to the function.
`C`	An estimation of the normalizing constant (C=sum Bi Pr(Mi), for Mi in the model space) using the method in George and McCulloch (1997).
`positions`	A binary matrix with `p` rows and (pnum+sum_j L_j) columns. The 1's identify, for each variable (row) the position (column) of dummies (in case of factor) or of the numerical variable grouped on that variable. (Its use is conceived for internal purposes).
`positionsx`	A `p` dimensional binary vector, stating which of the competing variables is a numerical variable. (Its use is conceived for internal purposes).
`prior.betas`	prior.betas
`prior.models`	prior.models
`method`	`gibbsWithFactors`

Author(s)

Gonzalo Garcia-Donato and Anabel Forte

References

Garcia-Donato, G. and Martinez-Beneito, M.A. (2013)<DOI:10.1080/01621459.2012.742443> On sampling strategies in Bayesian variable selection problems with large model spaces. Journal of the American Statistical Association, 108: 340-352.

Garcia-Donato, G. and Paulo, R. (2021) Variable selection in the presence of factors: a model selection perspective. Journal of American Statistical Association, Ahead-of- print(1-11).

George E. and McCulloch R. (1997) Approaches for Bayesian variable selection. Statistica Sinica, 7, 339:372.

Examples


## Not run: 
data(diabetes, package="faraway")

#remove NA's and the column with the id of samples:
diabetes2<- na.omit(diabetes)[,-1]

#For reproducibility:
set.seed(16091956)
#Now run the main instruction
diabetesVS<- GibbsBvsF(formula= glyhb ~ ., data=diabetes2, n.iter=100000, n.burnin=5000)

summary(diabetesVS)

#A plot of the dimension of the true model,
plot(diabetesVS, option="dimension")

#A joint inclusion plot
plot(diabetesVS, option="joint")

#Now a similar exercise but with fixed variables:
diabetesVS2<- GibbsBvsF(formula= glyhb ~ ., null.model= glyhb ~ chol+stab.glu,
		                   data=diabetes2, n.iter=100000, n.burnin=5000)


#and with fixed factors:
diabetesVS3<- GibbsBvsF(formula= glyhb ~ ., null.model= glyhb ~ chol+stab.glu+location,
		                   data=diabetes2, n.iter=100000, n.burnin=5000)



## End(Not run)

[Package BayesVarSel version 2.2.5 Index]