R: Bayesian variable selection for the Pogit model

pogitBvs {pogit}

R Documentation

Bayesian variable selection for the Pogit model

Description

This function performs Bayesian variable selection for a Poisson-Logistic (Pogit) model via spike and slab priors. For posterior inference, a MCMC sampling scheme is used that relies on augmenting the observed data by the unobserved counts and involves only Gibbs sampling steps.

Usage

pogitBvs(
  y,
  E = NULL,
  X,
  W = NULL,
  validation = NULL,
  method = "val",
  model = list(),
  prior = list(),
  mcmc = list(),
  start = list(),
  BVS = TRUE
)

Arguments

`y`	an integer vector of observed counts for units i = 1,...,I
`E`	an (optional) vector containing total exposure times (offset); should be `NULL` or an integer vector of length equal to the number of counts.
`X`	a design matrix in the Poisson part of the joint model
`W`	a design matrix in the logit part of the joint model (can be a subset of `X`) or `NULL`, if the same design matrix is used in both sub-models, i.e. `W` = `X`.
`validation`	a two-column data frame or list with the number of reported cases (= `v`) in the validation sample and the number of total cases (= `m`) subject to the fallible reporting process (i.e. validation sample size) for each unit (or sub-category); required if `method =` "`val`", otherwise `NULL`. The number of rows must conform with the number of rows in `W` or with the number of units I (if `X = W`), respectively.
`method`	the method to be used to obtain parameter identification: The default method "`val`" requires a small sample of validation data (see `validation`). If the information on all or some parameters of the reporting process is not provided by validation data, an informative prior distribution for the regression effects in the logit sub-model can be used (`method` = "`infprior`"). This prior information is encoded in a normal distribution instead of the spike and slab prior (see the details for `prior`).
`model`	a list specifying the structure of the model (see details)
`prior`	an (optional) list of prior settings and hyper-parameters controlling the priors (see details)
`mcmc`	an (optional) list of MCMC sampling options (see details)
`start`	an (optional) list containing starting values for the regression effects in both sub-models (see details)
`BVS`	if `TRUE` (default), Bayesian variable selection (in at least one part of the joint model) is performed to identify regressors with a non-zero effect; otherwise, an unrestricted model is estimated (without variable selection).

Details

The method provides Bayesian variable selection for regression models of count data subject to under-reporting using mixture priors with a spike and a slab component. By augmenting the observed count data with the unobserved counts, the resulting model can be factorized into a Poisson and a binomial logit model part. Hence, for this two-part model, sampling algorithms for a Poisson and a binomial logit model can be used which are described in poissonBvs and logitBvs. Bayesian variable selection is incorporated in both parts of the joint model using mixture priors with a Dirac spike and (by default) a Student-t slab. The implementation relies on the representation of the respective model as a Gaussian regression model in auxiliary variables (see again the help for the respective function). Though variable selection is primarily used to identify regressors with a non-zero effect, it can also be useful for identification of the Pogit model.

By default, identification of the Pogit model is achieved by additional information on the reporting process through validation data and incorporation of variable selection. If the information on the parameters of the reporting process is not provided by validation data, the identification of the model parameters has to be guaranteed by specifying an informative prior distribution (see arguments).

To model under-reported clustered data, a cluster-specific random intercept can be included in both model parts of the Pogit model to account for dependence within clusters. Bayesian variance selection is applied to determine whether there is within-cluster dependence in either part of the model. Note that an observation-specific random intercept in the Poisson sub-model yields an overdispersed Pogit model for unobserved heterogeneity.

For details concerning the sampling algorithm see Dvorzak and Wagner (2016).

Details for model specification (see arguments):

model:

deltaBetafix, deltaAlphafix: indicator vectors of length ncol(X)-1 and ncol(W)-1, respectively, for the Poisson and the logit sub-model, to specify which regression effects are subject to selection (i.e., 0 = subject to selection, 1 = fix in the model); defaults to vectors of zeros.
gammaBetafix, gammaAlphafix: indicators for variance selection of the random intercept term in the Poisson and the logit sub-model (i.e., 0 = with variance selection (default), 1 = no variance selection); only used if a random intercept is included in either part of the joint model (see riBeta and riAlpha, respectively).
riBeta, riAlpha: logical. If TRUE, a cluster-specific random intercept is included in the respective part of the joint model; defaults to FALSE.
clBetaID, clAlphaID: numeric vectors of length equal to the number of observations containing the cluster ID c = 1,...,C for each unit (or sub-category) in the respective sub-model (required if riBeta=TRUE or riAlpha=TRUE, respectively).
subcat: a factor variable of length equal to the number of units that specifies for which sub-category validation data are available (is required if W is a subset of X). If NULL (default), it is presumed that validation data are available for each unit (see also examples).

prior:

slabP, slabL: distribution of the slab component in the Poisson and logit sub-model, i.e. "Student" (default) or "Normal".
psi.nuP, psi.nuL: hyper-parameter of the Student-t slab in the respective sub-model (used for a Student-t slab); defaults to 5.
m0b, m0a: prior mean for the intercept parameter in the Poisson and the logit model; defaults to 0. If the argument method = "infprior", the specification of m0a is required.
M0b, M0a: prior variance for the intercept parameter in the Poisson and the logit model; defaults to 100.
bj0, aj0: a vector of prior means for the regression effects in the Poisson and the logit sub-model (which is encoded in a normal distribution, see notes); defaults to a vector of zeros. If the argument method = "infprior", the specification of aj0 is mandatory.
VP, VL: variance of the slab in the respective sub-model; defaults to 5.
wBeta, wAlpha: hyper-parameters of the Beta-prior for the mixture weights \omega_\beta and \omega_\alpha in the respective sub-model; defaults to c(wa0=1, wb0=1), i.e. a uniform distribution.
piBeta, piAlpha: hyper-parameters of the Beta-prior for the mixture weights \pi_\beta and \pi_\alpha in the respective sub-model; defaults to c(pa0=1, pb0=1), i.e. a uniform distribution.

mcmc:

M: number of MCMC iterations after the burn-in phase; defaults to 8000.
burnin: number of MCMC iterations discarded as burn-in; defaults to 2000.
thin: thinning parameter; defaults to 1.
startsel: number of MCMC iterations drawn from the unrestricted model (e.g., burnin/2); defaults to 1000.
verbose: MCMC progress report in each verbose-th iteration step; defaults to 500. If verbose=0, no output is generated.
msave: returns additional output with variable selection details (i.e. posterior samples for \omega_\beta, \omega_\alpha, \delta_\beta, \delta_\alpha, \pi_\beta, \pi_\alpha, \gamma_\beta, \gamma_\alpha); defaults to FALSE.

start:

beta: a vector of length ncol(X) containing starting values for the regression parameters \beta in the Poisson model part. By default, a Poisson glm is fitted to the observed counts.
alpha: a vector of length ncol(W) containing starting values for the regression parameters \alpha in the logit model part. By default, a binomial glm is fitted to the validation data for method = "val". If method = "infprior", starting values for \alpha are sampled from the (informative) prior distribution.
firth: logical. If TRUE, a logistic regression model applying Firth's correction to the likelihood using logistf is fitted to the validation data (only used if method = "val").

Value

The function returns an object of class "pogit" with methods print.pogit, summary.pogit and plot.pogit.

An object of class "pogit" is a list containing the following elements:

`samplesL`	a named list containing the samples from the posterior distribution of the parameters in the logit part of the joint model (see also `msave`): `alpha, thetaAlpha` regression coefficients `\alpha` and `\theta_\alpha` `pdeltaAlpha` P(`\delta_\alpha`=1) `psiAlpha` scale parameter `\psi_\alpha` of the slab component `pgammaAlpha` P(`\gamma_\alpha`=1) `ai` cluster-specific random intercept
`samplesP`	a named list containing the samples from the posterior distribution of the parameters in the Poisson part of the joint model (see also `msave`): `beta, thetaBeta` regression coefficients `\beta` and `\theta_\beta` `pdeltaBeta` P(`\delta_\beta`=1) `psiBeta` scale parameter `\psi_\beta` of the slab component `pgammaBeta` P(`\gamma_\beta`=1) `bi` cluster-specific random intercept
`data`	a list containing the data `y`, `E`, `X`, `W`, `val` and `subcat`
`model.logit`	a list containing details on the model specification in the logit sub-model, see details for `model`
`model.pois`	a list containing details on the model specification in the Poisson sub-model, see details for `model`
`mcmc`	see details for `mcmc`
`prior.logit`	see details for `prior`
`prior.pois`	see details for `prior`
`dur`	a list containing the total runtime (`total`) and the runtime after burn-in (`durM`), in seconds
`BVS`	see arguments
`method`	see arguments
`start`	a list containing starting values, see arguments
`family`	"pogit"
`call`	function call

Note

If method = "infprior", an informative prior for the regression parameters in the logit model is required to guarantee identification of the model parameters. Otherwise, identification of the Pogit model may be weak and inference will be biased.

Author(s)

Michaela Dvorzak <m.dvorzak@gmx.at>, Helga Wagner

References

Dvorzak, M. and Wagner, H. (2016). Sparse Bayesian modelling of underreported count data. Statistical Modelling, 16(1), 24 - 46, doi:10.1177/1471082x15588398.

Examples

## Not run: 
## Examples below (except for m3) should take 3-4 minutes. 

## ------ (use simul1) ------
# load simulated data set 'simul1'
data(simul1)
y <- simul1$y
E <- simul1$E
X <- as.matrix(simul1[, -c(1,2,8,9)]) # W = X
validation <- simul1[, c("m", "v"), drop = FALSE]

# function call (with specific MCMC settings)
m1 <- pogitBvs(y = y, E = E, X = X, validation = validation, 
               mcmc = list(M = 4000, thin = 5, verbose = 1000))

# print, summarize and plot results
print(m1)
summary(m1)
plot(m1)

# show traceplots disregarding burn-in and thinning
plot(m1, burnin = FALSE, thin = FALSE)
# show density plot of MCMC draws
plot(m1, type = "density")

# informative prior instead of validation data (change prior settings)
# e.g. available prior information on reporting probabilities 
p.a0 <- 0.9 
p.a  <- c(0.125, 0.5, 0.5, 0.5)
m0a_inf <- log(p.a0/(1 - p.a0))  # prior information for alpha_0
aj0_inf <- log(p.a/(1 - p.a))    # prior information for alpha

prior.set <- list(m0a = m0a_inf, aj0 = aj0_inf, VL = 0.005, slabL = "Normal")
m2 <- pogitBvs(y = y, E = E, X = X, method = "infprior", prior = prior.set, 
               mcmc = list(M = 4000, burnin = 2000, thin = 2), BVS = FALSE)
print(m2)
summary(m2)
plot(m2)
plot(m2, type = "acf", lag.max = 30)

## ------ (use simul2) ------
# complex model (with a long (!) runtime)

# load simulated data set 'simul2'
data(simul2)
y <- simul2$y
E <- simul2$E
cID <- simul2$cID
X <- as.matrix(simul2[, -c(1:3,9,10)])
validation <- simul2[, c("v", "m"), drop = FALSE]
 
# function call (with random intercept in both sub-models)
model <- list(riBeta = 1, riAlpha = 1, clBetaID = cID, clAlphaID = cID)
m3 <- pogitBvs(y = y, E = E, X = X, validation = validation, model = model, 
               mcmc = list(M = 6000, burnin = 200, thin = 10), BVS = TRUE)
print(m3)
summary(m3)
plot(m3)

## ------ (use cervical cancer data) ------
# load cervical cancer data
data(cervical)
data(cervical_validation)
y <- cervical$y
E <- cervical$E
X <- as.matrix(cervical[, -c(1:4)])
validation <- cervical_validation[, c(1, 2), drop = FALSE]
W          <- as.matrix(cervical_validation[, -c(1:3)])
subcat     <- factor(as.numeric(cervical$country))

# function call 
m4 <- pogitBvs(y = y, E = E, X = X, W = W, validation = validation, 
               model = list(subcat = subcat), mcmc = list(M = 10000, 
               burnin = 2000, thin = 10), start = list(firth = TRUE), 
               BVS = TRUE)             
print(m4)
# additionally compute estimated risks and reporting probabilities
summary(m4, printRes = TRUE) 
plot(m4, burnin = FALSE, thin = FALSE)
plot(m4, type = "acf", lag.max = 50)

# informative prior instead of validation data (change prior settings)
# e.g. prior information on country-specific reporting probabilities 
p.a0 <- 0.85
p.a  <- c(0.99, 0.70, 0.85)
m0a_inf <- log(p.a0/(1 - p.a0))  # prior information for alpha_0
aj0_inf <- log(p.a/(1 - p.a))    # prior information for alpha

prior.set <- list(m0a = m0a_inf, aj0 = aj0_inf, VL = 0.005, slabL = "Normal")
m5 <- pogitBvs(y = y, X = X, W = W, E = E, method = "infprior", 
               model = list(subcat = subcat), prior = prior.set, 
               mcmc = list(M = 10000, burnin = 2000, thin = 10))
print(m5)
summary(m5, printRes = TRUE)
plot(m5, burnin = FALSE, thin = FALSE)
plot(m5, type = "acf", lag.max = 50)

## End(Not run)

[Package pogit version 1.3.0 Index]