GibbsBvsF {BayesVarSel}R Documentation

Bayesian Variable Selection with Factors for linear regression models using Gibbs sampling.

Description

Numerical and factor variable selection from a Bayesian perspective. The posterior distribution is approximated with Gibbs sampling

Usage

GibbsBvsF(
  formula,
  data,
  null.model = paste(as.formula(formula)[[2]], " ~ 1", sep = ""),
  prior.betas = "Robust",
  prior.models = "SBSB",
  n.iter = 10000,
  init.model = "Full",
  n.burnin = 500,
  n.thin = 1,
  time.test = TRUE,
  seed = runif(1, 0, 16091956)
)

Arguments

formula

Formula defining the most complex linear model in the analysis. See details.

data

data frame containing the data.

null.model

A formula defining which is the simplest (null) model. It should be nested in the full model. It is compulsory that the null model contains the intercept and by default, the null model is defined to be the one with just the intercept

prior.betas

Prior distribution for regression parameters within each model (to be literally specified). Possible choices include "Robust", "Robust.G", "Liangetal", "gZellner", "ZellnerSiow" (see details in Bvs).

prior.models

Prior distribution over the model space (to be literally specified). Possible choices (see details) are "Const", "SB", "ConstConst", "SBConst" and "SBSB" (the default).

n.iter

The total number of iterations performed after the burn in process.

init.model

The model at which the simulation process starts. Options include "Null" (the model only with the covariates specified in fixed.cov), "Full" (the model defined by formula), "Random" (a randomly selected model) and a vector with (pnum+sum_j L_j) zeros and ones defining a model.

n.burnin

Length of burn in, i.e. number of iterations to discard at the beginning.

n.thin

Thinning rate. Must be a positive integer. Set 'n.thin' > 1 to save memory and computation time if 'n.iter' is large. Default is 1. This parameter jointly with n.iter sets the number of simulations kept and used to construct the estimates so is important to keep in mind that a large value for 'n.thin' can reduce the precision of the results

time.test

If TRUE and the number of variables is large (>=21) a preliminary test to estimate computational time is performed.

seed

A seed to initialize the random number generator

Details

In practical terms, GibbsBvsF can be understood as a version of GibbsBvs in the presence of factors. The methodology implemented in GibbsBvsF to handle variable selection problems with factors has been proposed in Garcia-Donato and Paulo (2018) leading to a method for which results do not depend on how the factors are coded (eg. via contrast).

Internally, a rank defficient representation of factors using dummies is used and the number of competing models considered is

2^(pnum+sum_j L_j),

where pnum is the number of numerical variables and L_j is the number of levels in factor j.

A main difference with Bvs and GibbsBvs (due to the presence of factors) concerns the prior probabilities on the model space:

The options prior.models="SBSB", prior.models="ConstConst" and prior.models="SBConst" acknowledge the "grouped" nature of the dummy variables representing factors through the use of two stage priors described in Garcia-Donato and Paulo (2021). In the first stage probabilities over factors and numerical variables are specified and (conditional on these) within the second stage the probablities are apportioned over the different submodels defined by the dummies. The default (and recommended, for the reasons argued in Garcia-Donato and Paulo,2021) option is "SBSB" which uses in both stages an assignment of the type Scott-Berger so inversely proportional to the number of models of the same dimension. The option "ConstConst" implements a uniform prior for both stages while "SBConst" uses a Scott-Berger prior in the first stage and it is uniform in the second stage. Within all these priors, the prior inclusion probabilities of factors and numerical variables are 1/2.

The options prior.models="Const" and prior.models="SB" do not have a staged structure and "Const" apportions the prior probabilities uniformly over all possible models (2^(pnum+sum_j L_j)) and in "SB" the probability is inversely proportional to the number of any model of the same dimension. In these cases, prior inclusion probabilities of factors and numerical variables depend on the number of levels of factors and, in general, are not 1/2.

Value

GibbsBvsF returns an object of class Bvs with the following elements:

time

The internal time consumed in solving the problem

lmfull

The lm class object that results when the model defined by formula is fitted by lm

lmnull

The lm class object that results when the model defined by fixed.cov is fitted by lm

variables

The name of all the potential explanatory variables (numerical or factors)

n

Number of observations

p

Number of explanatory variables (both numerical and factors) to select from

k

Number of fixed variables

HPMbin

The binary expression of the most probable model found.

inclprob

A named vector with the estimates of the inclusion probabilities of all the variables.

jointinclprob

A data.frame with the estimates of the joint inclusion probabilities of all the variables.

postprobdim

Estimates of posterior probabilities of the number of active variables in the true model (hence ranking from k to k+p).

modelslogBF

A matrix with both the binary representation of the active variables in the MCMC after the burning period and the Bayes factor (log scale) of that model to the null model.

modelswllogBF

A matrix with both the binary representation of the active variables (at the level of the levels in the factors) in the MCMC after the burning period and the Bayes factor (log scale) of that model to the null model.

call

The call to the function.

C

An estimation of the normalizing constant (C=sum Bi Pr(Mi), for Mi in the model space) using the method in George and McCulloch (1997).

positions

A binary matrix with p rows and (pnum+sum_j L_j) columns. The 1's identify, for each variable (row) the position (column) of dummies (in case of factor) or of the numerical variable grouped on that variable. (Its use is conceived for internal purposes).

positionsx

A p dimensional binary vector, stating which of the competing variables is a numerical variable. (Its use is conceived for internal purposes).

prior.betas

prior.betas

prior.models

prior.models

method

gibbsWithFactors

Author(s)

Gonzalo Garcia-Donato and Anabel Forte

References

Garcia-Donato, G. and Martinez-Beneito, M.A. (2013)<DOI:10.1080/01621459.2012.742443> On sampling strategies in Bayesian variable selection problems with large model spaces. Journal of the American Statistical Association, 108: 340-352.

Garcia-Donato, G. and Paulo, R. (2021) Variable selection in the presence of factors: a model selection perspective. Journal of American Statistical Association, Ahead-of- print(1-11).

George E. and McCulloch R. (1997) Approaches for Bayesian variable selection. Statistica Sinica, 7, 339:372.

See Also

plot.Bvs for several plots of the result.

Under construction: BMAcoeff for obtaining model averaged simulations of regression coefficients and predict.Bvs for predictions.

See GibbsBvs and Bvs when no factors are involved.

Examples


## Not run: 
data(diabetes, package="faraway")

#remove NA's and the column with the id of samples:
diabetes2<- na.omit(diabetes)[,-1]

#For reproducibility:
set.seed(16091956)
#Now run the main instruction
diabetesVS<- GibbsBvsF(formula= glyhb ~ ., data=diabetes2, n.iter=100000, n.burnin=5000)

summary(diabetesVS)

#A plot of the dimension of the true model,
plot(diabetesVS, option="dimension")

#A joint inclusion plot
plot(diabetesVS, option="joint")

#Now a similar exercise but with fixed variables:
diabetesVS2<- GibbsBvsF(formula= glyhb ~ ., null.model= glyhb ~ chol+stab.glu,
		                   data=diabetes2, n.iter=100000, n.burnin=5000)


#and with fixed factors:
diabetesVS3<- GibbsBvsF(formula= glyhb ~ ., null.model= glyhb ~ chol+stab.glu+location,
		                   data=diabetes2, n.iter=100000, n.burnin=5000)



## End(Not run)


[Package BayesVarSel version 2.2.5 Index]