pogitBvs {pogit}R Documentation

Bayesian variable selection for the Pogit model

Description

This function performs Bayesian variable selection for a Poisson-Logistic (Pogit) model via spike and slab priors. For posterior inference, a MCMC sampling scheme is used that relies on augmenting the observed data by the unobserved counts and involves only Gibbs sampling steps.

Usage

pogitBvs(
  y,
  E = NULL,
  X,
  W = NULL,
  validation = NULL,
  method = "val",
  model = list(),
  prior = list(),
  mcmc = list(),
  start = list(),
  BVS = TRUE
)

Arguments

y

an integer vector of observed counts for units i = 1,...,I

E

an (optional) vector containing total exposure times (offset); should be NULL or an integer vector of length equal to the number of counts.

X

a design matrix in the Poisson part of the joint model

W

a design matrix in the logit part of the joint model (can be a subset of X) or NULL, if the same design matrix is used in both sub-models, i.e. W = X.

validation

a two-column data frame or list with the number of reported cases (= v) in the validation sample and the number of total cases (= m) subject to the fallible reporting process (i.e. validation sample size) for each unit (or sub-category); required if method = "val", otherwise NULL. The number of rows must conform with the number of rows in W or with the number of units I (if X = W), respectively.

method

the method to be used to obtain parameter identification: The default method "val" requires a small sample of validation data (see validation). If the information on all or some parameters of the reporting process is not provided by validation data, an informative prior distribution for the regression effects in the logit sub-model can be used (method = "infprior"). This prior information is encoded in a normal distribution instead of the spike and slab prior (see the details for prior).

model

a list specifying the structure of the model (see details)

prior

an (optional) list of prior settings and hyper-parameters controlling the priors (see details)

mcmc

an (optional) list of MCMC sampling options (see details)

start

an (optional) list containing starting values for the regression effects in both sub-models (see details)

BVS

if TRUE (default), Bayesian variable selection (in at least one part of the joint model) is performed to identify regressors with a non-zero effect; otherwise, an unrestricted model is estimated (without variable selection).

Details

The method provides Bayesian variable selection for regression models of count data subject to under-reporting using mixture priors with a spike and a slab component. By augmenting the observed count data with the unobserved counts, the resulting model can be factorized into a Poisson and a binomial logit model part. Hence, for this two-part model, sampling algorithms for a Poisson and a binomial logit model can be used which are described in poissonBvs and logitBvs. Bayesian variable selection is incorporated in both parts of the joint model using mixture priors with a Dirac spike and (by default) a Student-t slab. The implementation relies on the representation of the respective model as a Gaussian regression model in auxiliary variables (see again the help for the respective function). Though variable selection is primarily used to identify regressors with a non-zero effect, it can also be useful for identification of the Pogit model.

By default, identification of the Pogit model is achieved by additional information on the reporting process through validation data and incorporation of variable selection. If the information on the parameters of the reporting process is not provided by validation data, the identification of the model parameters has to be guaranteed by specifying an informative prior distribution (see arguments).

To model under-reported clustered data, a cluster-specific random intercept can be included in both model parts of the Pogit model to account for dependence within clusters. Bayesian variance selection is applied to determine whether there is within-cluster dependence in either part of the model. Note that an observation-specific random intercept in the Poisson sub-model yields an overdispersed Pogit model for unobserved heterogeneity.

For details concerning the sampling algorithm see Dvorzak and Wagner (2016).

Details for model specification (see arguments):

model:
deltaBetafix, deltaAlphafix

indicator vectors of length ncol(X)-1 and ncol(W)-1, respectively, for the Poisson and the logit sub-model, to specify which regression effects are subject to selection (i.e., 0 = subject to selection, 1 = fix in the model); defaults to vectors of zeros.

gammaBetafix, gammaAlphafix

indicators for variance selection of the random intercept term in the Poisson and the logit sub-model (i.e., 0 = with variance selection (default), 1 = no variance selection); only used if a random intercept is included in either part of the joint model (see riBeta and riAlpha, respectively).

riBeta, riAlpha

logical. If TRUE, a cluster-specific random intercept is included in the respective part of the joint model; defaults to FALSE.

clBetaID, clAlphaID

numeric vectors of length equal to the number of observations containing the cluster ID c = 1,...,C for each unit (or sub-category) in the respective sub-model (required if riBeta=TRUE or riAlpha=TRUE, respectively).

subcat

a factor variable of length equal to the number of units that specifies for which sub-category validation data are available (is required if W is a subset of X). If NULL (default), it is presumed that validation data are available for each unit (see also examples).

prior:
slabP, slabL

distribution of the slab component in the Poisson and logit sub-model, i.e. "Student" (default) or "Normal".

psi.nuP, psi.nuL

hyper-parameter of the Student-t slab in the respective sub-model (used for a Student-t slab); defaults to 5.

m0b, m0a

prior mean for the intercept parameter in the Poisson and the logit model; defaults to 0. If the argument method = "infprior", the specification of m0a is required.

M0b, M0a

prior variance for the intercept parameter in the Poisson and the logit model; defaults to 100.

bj0, aj0

a vector of prior means for the regression effects in the Poisson and the logit sub-model (which is encoded in a normal distribution, see notes); defaults to a vector of zeros. If the argument method = "infprior", the specification of aj0 is mandatory.

VP, VL

variance of the slab in the respective sub-model; defaults to 5.

wBeta, wAlpha

hyper-parameters of the Beta-prior for the mixture weights \omega_\beta and \omega_\alpha in the respective sub-model; defaults to c(wa0=1, wb0=1), i.e. a uniform distribution.

piBeta, piAlpha

hyper-parameters of the Beta-prior for the mixture weights \pi_\beta and \pi_\alpha in the respective sub-model; defaults to c(pa0=1, pb0=1), i.e. a uniform distribution.

mcmc:
M

number of MCMC iterations after the burn-in phase; defaults to 8000.

burnin

number of MCMC iterations discarded as burn-in; defaults to 2000.

thin

thinning parameter; defaults to 1.

startsel

number of MCMC iterations drawn from the unrestricted model (e.g., burnin/2); defaults to 1000.

verbose

MCMC progress report in each verbose-th iteration step; defaults to 500. If verbose=0, no output is generated.

msave

returns additional output with variable selection details (i.e. posterior samples for \omega_\beta, \omega_\alpha, \delta_\beta, \delta_\alpha, \pi_\beta, \pi_\alpha, \gamma_\beta, \gamma_\alpha); defaults to FALSE.

start:
beta

a vector of length ncol(X) containing starting values for the regression parameters \beta in the Poisson model part. By default, a Poisson glm is fitted to the observed counts.

alpha

a vector of length ncol(W) containing starting values for the regression parameters \alpha in the logit model part. By default, a binomial glm is fitted to the validation data for method = "val". If method = "infprior", starting values for \alpha are sampled from the (informative) prior distribution.

firth

logical. If TRUE, a logistic regression model applying Firth's correction to the likelihood using logistf is fitted to the validation data (only used if method = "val").

Value

The function returns an object of class "pogit" with methods print.pogit, summary.pogit and plot.pogit.

An object of class "pogit" is a list containing the following elements:

samplesL

a named list containing the samples from the posterior distribution of the parameters in the logit part of the joint model (see also msave):

alpha, thetaAlpha

regression coefficients \alpha and \theta_\alpha

pdeltaAlpha

P(\delta_\alpha=1)

psiAlpha

scale parameter \psi_\alpha of the slab component

pgammaAlpha

P(\gamma_\alpha=1)

ai

cluster-specific random intercept

samplesP

a named list containing the samples from the posterior distribution of the parameters in the Poisson part of the joint model (see also msave):

beta, thetaBeta

regression coefficients \beta and \theta_\beta

pdeltaBeta

P(\delta_\beta=1)

psiBeta

scale parameter \psi_\beta of the slab component

pgammaBeta

P(\gamma_\beta=1)

bi

cluster-specific random intercept

data

a list containing the data y, E, X, W, val and subcat

model.logit

a list containing details on the model specification in the logit sub-model, see details for model

model.pois

a list containing details on the model specification in the Poisson sub-model, see details for model

mcmc

see details for mcmc

prior.logit

see details for prior

prior.pois

see details for prior

dur

a list containing the total runtime (total) and the runtime after burn-in (durM), in seconds

BVS

see arguments

method

see arguments

start

a list containing starting values, see arguments

family

"pogit"

call

function call

Note

If method = "infprior", an informative prior for the regression parameters in the logit model is required to guarantee identification of the model parameters. Otherwise, identification of the Pogit model may be weak and inference will be biased.

Author(s)

Michaela Dvorzak <m.dvorzak@gmx.at>, Helga Wagner

References

Dvorzak, M. and Wagner, H. (2016). Sparse Bayesian modelling of underreported count data. Statistical Modelling, 16(1), 24 - 46, doi:10.1177/1471082x15588398.

See Also

logitBvs, poissonBvs

Examples

## Not run: 
## Examples below (except for m3) should take 3-4 minutes. 

## ------ (use simul1) ------
# load simulated data set 'simul1'
data(simul1)
y <- simul1$y
E <- simul1$E
X <- as.matrix(simul1[, -c(1,2,8,9)]) # W = X
validation <- simul1[, c("m", "v"), drop = FALSE]

# function call (with specific MCMC settings)
m1 <- pogitBvs(y = y, E = E, X = X, validation = validation, 
               mcmc = list(M = 4000, thin = 5, verbose = 1000))

# print, summarize and plot results
print(m1)
summary(m1)
plot(m1)

# show traceplots disregarding burn-in and thinning
plot(m1, burnin = FALSE, thin = FALSE)
# show density plot of MCMC draws
plot(m1, type = "density")

# informative prior instead of validation data (change prior settings)
# e.g. available prior information on reporting probabilities 
p.a0 <- 0.9 
p.a  <- c(0.125, 0.5, 0.5, 0.5)
m0a_inf <- log(p.a0/(1 - p.a0))  # prior information for alpha_0
aj0_inf <- log(p.a/(1 - p.a))    # prior information for alpha

prior.set <- list(m0a = m0a_inf, aj0 = aj0_inf, VL = 0.005, slabL = "Normal")
m2 <- pogitBvs(y = y, E = E, X = X, method = "infprior", prior = prior.set, 
               mcmc = list(M = 4000, burnin = 2000, thin = 2), BVS = FALSE)
print(m2)
summary(m2)
plot(m2)
plot(m2, type = "acf", lag.max = 30)

## ------ (use simul2) ------
# complex model (with a long (!) runtime)

# load simulated data set 'simul2'
data(simul2)
y <- simul2$y
E <- simul2$E
cID <- simul2$cID
X <- as.matrix(simul2[, -c(1:3,9,10)])
validation <- simul2[, c("v", "m"), drop = FALSE]
 
# function call (with random intercept in both sub-models)
model <- list(riBeta = 1, riAlpha = 1, clBetaID = cID, clAlphaID = cID)
m3 <- pogitBvs(y = y, E = E, X = X, validation = validation, model = model, 
               mcmc = list(M = 6000, burnin = 200, thin = 10), BVS = TRUE)
print(m3)
summary(m3)
plot(m3)

## ------ (use cervical cancer data) ------
# load cervical cancer data
data(cervical)
data(cervical_validation)
y <- cervical$y
E <- cervical$E
X <- as.matrix(cervical[, -c(1:4)])
validation <- cervical_validation[, c(1, 2), drop = FALSE]
W          <- as.matrix(cervical_validation[, -c(1:3)])
subcat     <- factor(as.numeric(cervical$country))

# function call 
m4 <- pogitBvs(y = y, E = E, X = X, W = W, validation = validation, 
               model = list(subcat = subcat), mcmc = list(M = 10000, 
               burnin = 2000, thin = 10), start = list(firth = TRUE), 
               BVS = TRUE)             
print(m4)
# additionally compute estimated risks and reporting probabilities
summary(m4, printRes = TRUE) 
plot(m4, burnin = FALSE, thin = FALSE)
plot(m4, type = "acf", lag.max = 50)

# informative prior instead of validation data (change prior settings)
# e.g. prior information on country-specific reporting probabilities 
p.a0 <- 0.85
p.a  <- c(0.99, 0.70, 0.85)
m0a_inf <- log(p.a0/(1 - p.a0))  # prior information for alpha_0
aj0_inf <- log(p.a/(1 - p.a))    # prior information for alpha

prior.set <- list(m0a = m0a_inf, aj0 = aj0_inf, VL = 0.005, slabL = "Normal")
m5 <- pogitBvs(y = y, X = X, W = W, E = E, method = "infprior", 
               model = list(subcat = subcat), prior = prior.set, 
               mcmc = list(M = 10000, burnin = 2000, thin = 10))
print(m5)
summary(m5, printRes = TRUE)
plot(m5, burnin = FALSE, thin = FALSE)
plot(m5, type = "acf", lag.max = 50)

## End(Not run)

[Package pogit version 1.3.0 Index]