gamselBayes {gamselBayes}R Documentation

Bayesian generalized additive model selection including a fast variational option

Description

Selection of predictors and the nature of their impact on the mean response (linear versus non-linear) is a fundamental problem in regression analysis. This function uses the generalized additive models framework for estimating predictors effects. An approximate Bayesian inference approach and has two options for achieving this: (1) Markov chain Monte Carlo and (2) mean field variational Bayes.

Usage

gamselBayes(y,Xlinear  = NULL,Xgeneral = NULL,method = "MCMC",lowerMakesSparser = NULL,   
            family = "gaussian",verbose = TRUE,control = gamselBayes.control())

Arguments

y

Vector containing the response data. If 'family = "gaussian"' then the response data are modelled as being continuous with a Gaussian distribution. If 'family = "binomial"' then the response data must be binary with 0/1 coding.

Xlinear

Data frame with number of rows equal to the length of y. Each column contains data for a predictor which potentially has a linear or zero effect, but not a nonlinear effect. Binary predictors must be inputted through this matrix.

Xgeneral

A data frame with number of rows equal to the length of y. Each column contains data for a predictor which potentially has a linear, nonlinear or zero effect. Binary predictors cannot be inputted through this matrix.

method

Character string for specifying the method to be used:
"MCMC" = Markov chain Monte Carlo,
"MFVB" = mean field variational Bayes.

lowerMakesSparser

A threshold parameter between 0 and 1, which is such that lower values lead to sparser fits.

family

Character string for specifying the response family:
"gaussian" = response assumed to be Gaussian with constant variance,
"binomial" = response assumed to be binary.
The default is "gaussian".

verbose

Boolean variable for specifying whether or not progress messages are printed to the console. The default is TRUE.

control

Function for controlling the spline bases, Markov chain Monte Carlo sampling, mean field variational Bayes and other specifications.

Details

Generalized additive model selection via approximate Bayesian inference is provided. Bayesian mixed model-based penalized splines with spike-and-slab-type coefficient prior distributions are used to facilitate fitting and selection. The approximate Bayesian inference engine options are: (1) Markov chain Monte Carlo and (2) mean field variational Bayes. Markov chain Monte Carlo has better Bayesian inferential accuracy, but requires a longer run-time. Mean field variational Bayes is faster, but less accurate. The methodology is described in He and Wand (2021) <arXiv:2201.00412>.

Value

An object of class gamselBayes, which is a list with the following components:

method

the value of method.

family

the value of family.

Xlinear

the inputted design matrix containing predictors that can only have linear effects.

Xgeneral

the inputted design matrix containing predictors that are potentially have non-linear effects.

rangex

the value of the control parameter rangex.

intKnots

the value of the control parameter intKnots.

truncateBasis

the value of the control parameter truncateBasis.

numBasis

the value of the control parameter numBasis.

MCMC

a list such that each component is the retained Markov chain Monte Carlo (MCMC)sample for a model parameter. The components are:
beta0 = overall intercept.
betaTilde = linear component coefficients without multiplication by the gammaBeta values.
gammaBeta = linear component coefficients spike-and-slab auxiliary indicator variables.
sigmaBeta = standard deviation of the linear component coefficients.
rhoBeta = the Bernoulli distribution probability parameter of the linear component coefficients spike-and-slab auxiliary indicator variables.
uTilde = spline basis function coefficients without multiplication by the gammaUMCMC values. The MCMC samples are stored in a list. Each list component corresponds to a predictor that is treated as potentially having a non-linear effect, and is a matrix with columns corresponding to the spline basis function coefficients for that predictor and rows corresponding to the retained MCMC samples.
gammaU = spline basis coefficients spike-and-slab auxiliary indicator variables. The MCMC samples are stored in a list. Each list component corresponds to a predictor that is treated as potentially having a non-linear effect, and is a matrix with columns corresponding to the spline basis function coefficients for that predictor and rows corresponding to the retained MCMC samples.
rhoU = the Bernoulli distribution probability parameters of the spline basis component coefficients spike-and-slab auxiliary indicator variables. The MCMC samples are stored in a matrix. Each column corresponds to a predictor that is treated as potentially having a non-linear effect. The rows of the matrix correspond to the retained MCMC samples.
sigmaEps = error standard deviation.

MFVB

a list such that each component is the mean field variational Bayes approximate posterior density function, or q-density, parameters. The components are:
beta0 = a vector with 2 entries, consisting of the mean and variance of the Univariate Normal q-density of the overall intercept.
betaTilde = a two-component list containing the Multivariate Normal q-density parameters of linear component coefficients without multiplication by the means of the gammaBeta q-densities. The list components are: mu.q.betaTilde, the mean vector; Sigma.q.betaTilde, the covariance matrix.
gammaBeta = a vector containing the Bernoulli q-density means of the linear component coefficients spike-and-slab auxiliary indicator variables.
sigmaBeta = a vector with 2 entries, consisting of the Inverse Gamma q-density shape and rate parameters of the variance of the linear component coefficients.
rhoBeta = a vector with 2 entries, consisting of the Beta q-density shape parameters of the Bernoulli probability parameter of the linear component coefficients spike-and-slab auxiliary indicator variables.
uTilde = a two-component list containing the Multivariate Normal q-density parameters of the spline basis function coefficients without multiplication by the means of the gammaU q-densities. The list components are: mu.q.uTilde, the mean vectors for each predictor that is treated as potentially having a non-linear effect; sigsq.q.uTilde, the diagonal entries of the covariance matrices of each predictor that is treated as potentially having a non-linear effect.
gammaU = a list containing the q-density means of the spline basis coefficients spike-and-slab auxiliary indicator variables. Each list component corresponds to a predictor that is treated as potentially having a non-linear effect.
rhoU = a two-component list with components A.q.rho.u and B.q.rho.u. The A.q.rho.u list component is a vector of Beta q-density first (one plus the power of rho) shape parameters corresponding to the spline basis coefficients spike-and-slab auxiliary indicator variables for each predictor that is treated as potentially having a non-linear effect. The B.q.rho.u list component is a vector of Beta q-density second (one plus the power of 1-rho) shape parameters corresponding to the spline basis coefficients spike-and-slab auxiliary indicator variables for each predictor that is treated as potentially having a non-linear effect.
sigmaEps = a vector with 2 entries, consisting of the Inverse Gamma q-density shape and rate parameters of the error variance.

effectTypeHat

an array of character strings, with entry either "zero", "linear" or "nonlinear", signifying the estimated effect type for each candidate predictor.

meanXlinear

an array containing the sample means of each column of Xlinear.

sdXlinear

an array containing the sample standard deviations of each column of Xlinear.

meanXgeneral

an array containing the sample means of each column of Xgeneral.

sdXgeneral

an array containing the sample standard deviations of each column of Xgeneral.

Author(s)

Virginia X. He virginia.x.he@student.uts.edu.au and Matt P. Wand matt.wand@uts.edu.au

References

Chouldechova, A. and Hastie, T. (2015). Generalized additive model selection. <arXiv:1506.03850v2>.

He, V.X. and Wand, M.P. (2021). Bayesian generalized additive model selection including a fast variational option. <arXiv:2021.PLACE-HOLDER>.

Examples

library(gamselBayes) 

# Generate some simple regression-type data:

set.seed(1) ; n <- 1000 ; x1 <- rbinom(n,1,0.5) ; 
x2 <- runif(n) ; x3 <- runif(n) ; x4 <- runif(n)
y <- x1 + sin(2*pi*x2) - x3 + rnorm(n)
Xlinear <- data.frame(x1) ; Xgeneral <- data.frame(x2,x3,x4)

# Obtain a gamselBayes() fit for the data, using Markov chain Monte Carlo:

fitMCMC <- gamselBayes(y,Xlinear,Xgeneral)
summary(fitMCMC) ; plot(fitMCMC) ; checkChains(fitMCMC)

# Obtain a gamselBayes() fit for the data, using mean field variational Bayes:

fitMFVB <- gamselBayes(y,Xlinear,Xgeneral,method = "MFVB")
summary(fitMFVB) ; plot(fitMFVB)

if (require("Ecdat"))
{
   # Obtain a gamselBayes() fit for data on schools in California, U.S.A.:

   Caschool$log.avginc <- log(Caschool$avginc)
   mathScore <- Caschool$mathscr
   Xgeneral <- Caschool[,c("mealpct","elpct","calwpct","compstu","log.avginc")]

   # Obtain a gamselBayes() fit for the data, using Markov chain Monte Carlo:

   fitMCMC <- gamselBayes(y = mathScore,Xgeneral = Xgeneral)
   summary(fitMCMC) ; plot(fitMCMC) ; checkChains(fitMCMC)

   # Obtain a gamselBayes() fit for the data, using mean field variational Bayes:

   fitMFVB <- gamselBayes(y = mathScore,Xgeneral = Xgeneral,method = "MFVB")
   summary(fitMFVB) ; plot(fitMFVB)
}

[Package gamselBayes version 2.0-1 Index]