pogitBvs {pogit} | R Documentation |
Bayesian variable selection for the Pogit model
Description
This function performs Bayesian variable selection for a Poisson-Logistic (Pogit) model via spike and slab priors. For posterior inference, a MCMC sampling scheme is used that relies on augmenting the observed data by the unobserved counts and involves only Gibbs sampling steps.
Usage
pogitBvs(
y,
E = NULL,
X,
W = NULL,
validation = NULL,
method = "val",
model = list(),
prior = list(),
mcmc = list(),
start = list(),
BVS = TRUE
)
Arguments
y |
an integer vector of observed counts for units i = 1,...,I |
E |
an (optional) vector containing total exposure times (offset);
should be |
X |
a design matrix in the Poisson part of the joint model |
W |
a design matrix in the logit part of the joint model (can be a subset
of |
validation |
a two-column data frame or list with the number of
reported cases (= |
method |
the method to be used to obtain parameter identification:
The default method " |
model |
a list specifying the structure of the model (see details) |
prior |
an (optional) list of prior settings and hyper-parameters controlling the priors (see details) |
mcmc |
an (optional) list of MCMC sampling options (see details) |
start |
an (optional) list containing starting values for the regression effects in both sub-models (see details) |
BVS |
if |
Details
The method provides Bayesian variable selection for regression
models of count data subject to under-reporting using mixture priors with a spike
and a slab component.
By augmenting the observed count data with the unobserved counts, the resulting
model can be factorized into a Poisson and a binomial logit model part. Hence,
for this two-part model, sampling algorithms for a Poisson and a binomial
logit model can be used which are described in poissonBvs
and
logitBvs
.
Bayesian variable selection is incorporated in both parts of the joint model
using mixture priors with a Dirac spike and (by default) a Student-t slab.
The implementation relies on the representation of the respective model as a
Gaussian regression model in auxiliary variables (see again the help for the
respective function). Though variable selection is primarily used to identify
regressors with a non-zero effect, it can also be useful for identification of
the Pogit model.
By default, identification of the Pogit model is achieved by additional information on the reporting process through validation data and incorporation of variable selection. If the information on the parameters of the reporting process is not provided by validation data, the identification of the model parameters has to be guaranteed by specifying an informative prior distribution (see arguments).
To model under-reported clustered data, a cluster-specific random intercept can be included in both model parts of the Pogit model to account for dependence within clusters. Bayesian variance selection is applied to determine whether there is within-cluster dependence in either part of the model. Note that an observation-specific random intercept in the Poisson sub-model yields an overdispersed Pogit model for unobserved heterogeneity.
For details concerning the sampling algorithm see Dvorzak and Wagner (2016).
Details for model specification (see arguments):
model
:-
deltaBetafix, deltaAlphafix
indicator vectors of length
ncol(X)-1
andncol(W)-1
, respectively, for the Poisson and the logit sub-model, to specify which regression effects are subject to selection (i.e., 0 = subject to selection, 1 = fix in the model); defaults to vectors of zeros.gammaBetafix, gammaAlphafix
indicators for variance selection of the random intercept term in the Poisson and the logit sub-model (i.e., 0 = with variance selection (default), 1 = no variance selection); only used if a random intercept is included in either part of the joint model (see
riBeta
andriAlpha
, respectively).riBeta, riAlpha
logical. If
TRUE
, a cluster-specific random intercept is included in the respective part of the joint model; defaults toFALSE
.clBetaID, clAlphaID
numeric vectors of length equal to the number of observations containing the cluster ID c = 1,...,C for each unit (or sub-category) in the respective sub-model (required if
riBeta=TRUE
orriAlpha=TRUE
, respectively).subcat
a factor variable of length equal to the number of units that specifies for which sub-category validation data are available (is required if
W
is a subset ofX
). IfNULL
(default), it is presumed that validation data are available for each unit (see also examples).
prior
:-
slabP, slabL
distribution of the slab component in the Poisson and logit sub-model, i.e. "
Student
" (default) or "Normal
".psi.nuP, psi.nuL
hyper-parameter of the Student-t slab in the respective sub-model (used for a Student-t slab); defaults to 5.
m0b, m0a
prior mean for the intercept parameter in the Poisson and the logit model; defaults to 0. If the argument
method
= "infprior
", the specification ofm0a
is required.M0b, M0a
prior variance for the intercept parameter in the Poisson and the logit model; defaults to 100.
bj0, aj0
a vector of prior means for the regression effects in the Poisson and the logit sub-model (which is encoded in a normal distribution, see notes); defaults to a vector of zeros. If the argument
method
= "infprior
", the specification ofaj0
is mandatory.VP, VL
variance of the slab in the respective sub-model; defaults to 5.
wBeta, wAlpha
hyper-parameters of the Beta-prior for the mixture weights
\omega_\beta
and\omega_\alpha
in the respective sub-model; defaults toc(wa0=1, wb0=1)
, i.e. a uniform distribution.piBeta, piAlpha
hyper-parameters of the Beta-prior for the mixture weights
\pi_\beta
and\pi_\alpha
in the respective sub-model; defaults toc(pa0=1, pb0=1)
, i.e. a uniform distribution.
mcmc
:-
M
number of MCMC iterations after the burn-in phase; defaults to 8000.
burnin
number of MCMC iterations discarded as burn-in; defaults to 2000.
thin
thinning parameter; defaults to 1.
startsel
number of MCMC iterations drawn from the unrestricted model (e.g.,
burnin/2
); defaults to 1000.verbose
MCMC progress report in each
verbose
-th iteration step; defaults to 500. Ifverbose=0
, no output is generated.msave
returns additional output with variable selection details (i.e. posterior samples for
\omega_\beta
,\omega_\alpha
,\delta_\beta
,\delta_\alpha
,\pi_\beta
,\pi_\alpha
,\gamma_\beta
,\gamma_\alpha
); defaults toFALSE
.
start
:-
beta
a vector of length
ncol(X)
containing starting values for the regression parameters\beta
in the Poisson model part. By default, a Poisson glm is fitted to the observed counts.alpha
a vector of length
ncol(W)
containing starting values for the regression parameters\alpha
in the logit model part. By default, a binomial glm is fitted to the validation data formethod
= "val
". Ifmethod
= "infprior
", starting values for\alpha
are sampled from the (informative) prior distribution.firth
logical. If
TRUE
, a logistic regression model applying Firth's correction to the likelihood usinglogistf
is fitted to the validation data (only used ifmethod
= "val
").
Value
The function returns an object of class "pogit
" with methods
print.pogit
, summary.pogit
and
plot.pogit
.
An object of class "pogit
" is a list containing the following elements:
samplesL |
a named list containing the samples from the posterior
distribution of the parameters in the logit part of the joint model
(see also
|
samplesP |
a named list containing the samples from the posterior
distribution of the parameters in the Poisson part of the joint model
(see also
|
data |
a list containing the data |
model.logit |
a list containing details on the model specification in
the logit sub-model, see details for |
model.pois |
a list containing details on the model specification in
the Poisson sub-model, see details for |
mcmc |
see details for |
prior.logit |
see details for |
prior.pois |
see details for |
dur |
a list containing the total runtime ( |
BVS |
see arguments |
method |
see arguments |
start |
a list containing starting values, see arguments |
family |
"pogit" |
call |
function call |
Note
If method
= "infprior
", an
informative prior for the regression parameters in the logit model is
required to guarantee identification of the model parameters.
Otherwise, identification of the Pogit model may be weak and inference
will be biased.
Author(s)
Michaela Dvorzak <m.dvorzak@gmx.at>, Helga Wagner
References
Dvorzak, M. and Wagner, H. (2016). Sparse Bayesian modelling of underreported count data. Statistical Modelling, 16(1), 24 - 46, doi:10.1177/1471082x15588398.
See Also
Examples
## Not run:
## Examples below (except for m3) should take 3-4 minutes.
## ------ (use simul1) ------
# load simulated data set 'simul1'
data(simul1)
y <- simul1$y
E <- simul1$E
X <- as.matrix(simul1[, -c(1,2,8,9)]) # W = X
validation <- simul1[, c("m", "v"), drop = FALSE]
# function call (with specific MCMC settings)
m1 <- pogitBvs(y = y, E = E, X = X, validation = validation,
mcmc = list(M = 4000, thin = 5, verbose = 1000))
# print, summarize and plot results
print(m1)
summary(m1)
plot(m1)
# show traceplots disregarding burn-in and thinning
plot(m1, burnin = FALSE, thin = FALSE)
# show density plot of MCMC draws
plot(m1, type = "density")
# informative prior instead of validation data (change prior settings)
# e.g. available prior information on reporting probabilities
p.a0 <- 0.9
p.a <- c(0.125, 0.5, 0.5, 0.5)
m0a_inf <- log(p.a0/(1 - p.a0)) # prior information for alpha_0
aj0_inf <- log(p.a/(1 - p.a)) # prior information for alpha
prior.set <- list(m0a = m0a_inf, aj0 = aj0_inf, VL = 0.005, slabL = "Normal")
m2 <- pogitBvs(y = y, E = E, X = X, method = "infprior", prior = prior.set,
mcmc = list(M = 4000, burnin = 2000, thin = 2), BVS = FALSE)
print(m2)
summary(m2)
plot(m2)
plot(m2, type = "acf", lag.max = 30)
## ------ (use simul2) ------
# complex model (with a long (!) runtime)
# load simulated data set 'simul2'
data(simul2)
y <- simul2$y
E <- simul2$E
cID <- simul2$cID
X <- as.matrix(simul2[, -c(1:3,9,10)])
validation <- simul2[, c("v", "m"), drop = FALSE]
# function call (with random intercept in both sub-models)
model <- list(riBeta = 1, riAlpha = 1, clBetaID = cID, clAlphaID = cID)
m3 <- pogitBvs(y = y, E = E, X = X, validation = validation, model = model,
mcmc = list(M = 6000, burnin = 200, thin = 10), BVS = TRUE)
print(m3)
summary(m3)
plot(m3)
## ------ (use cervical cancer data) ------
# load cervical cancer data
data(cervical)
data(cervical_validation)
y <- cervical$y
E <- cervical$E
X <- as.matrix(cervical[, -c(1:4)])
validation <- cervical_validation[, c(1, 2), drop = FALSE]
W <- as.matrix(cervical_validation[, -c(1:3)])
subcat <- factor(as.numeric(cervical$country))
# function call
m4 <- pogitBvs(y = y, E = E, X = X, W = W, validation = validation,
model = list(subcat = subcat), mcmc = list(M = 10000,
burnin = 2000, thin = 10), start = list(firth = TRUE),
BVS = TRUE)
print(m4)
# additionally compute estimated risks and reporting probabilities
summary(m4, printRes = TRUE)
plot(m4, burnin = FALSE, thin = FALSE)
plot(m4, type = "acf", lag.max = 50)
# informative prior instead of validation data (change prior settings)
# e.g. prior information on country-specific reporting probabilities
p.a0 <- 0.85
p.a <- c(0.99, 0.70, 0.85)
m0a_inf <- log(p.a0/(1 - p.a0)) # prior information for alpha_0
aj0_inf <- log(p.a/(1 - p.a)) # prior information for alpha
prior.set <- list(m0a = m0a_inf, aj0 = aj0_inf, VL = 0.005, slabL = "Normal")
m5 <- pogitBvs(y = y, X = X, W = W, E = E, method = "infprior",
model = list(subcat = subcat), prior = prior.set,
mcmc = list(M = 10000, burnin = 2000, thin = 10))
print(m5)
summary(m5, printRes = TRUE)
plot(m5, burnin = FALSE, thin = FALSE)
plot(m5, type = "acf", lag.max = 50)
## End(Not run)