dpmj {BNSP} | R Documentation |
Fits Dirichlet process mixtures of joint response-covariate models, where the covariates are of mixed type while the discrete responses are represented utilizing continuous latent variables. See ‘Details’ section for a full model description and Papageorgiou (2018) for all technical details.
dpmj(formula, Fcdf, data, offset, sampler = "truncated", Xpred, offsetPred, StorageDir, ncomp, sweeps, burn, thin = 1, seed, H, Hdf, d, D, Alpha.xi, Beta.xi, Alpha.alpha, Beta.alpha, Trunc.alpha, ...)
formula |
a formula defining the response and the covariates e.g. |
Fcdf |
a description of the kernel of the response variable. Currently five options are supported: 1. "poisson", 2. "negative binomial", 3. "generalized poisson", 4. "binomial" and 5. "beta binomial". The first three kernels are used for count data analysis, where the third kernel allows for both over- and under-dispersion relative to the Poisson distribution. The last two kernels are used for binomial data analysis. See ‘Details’ section for some of the kernel details. |
data |
an optional data frame, list or environment (or object coercible by ‘as.data.frame’ to a data frame) containing the variables in the model. If not found in ‘data’, the variables are taken from ‘environment(formula)’. |
offset |
this can be used to specify an a priori known component to be included in the model. This should be ‘NULL’ or a numeric vector of length equal to the sample size. One ‘offset’ term can be included in the formula, and if more are required, their sum should be used. |
sampler |
the MCMC algorithm to be utilized. The two options are |
Xpred |
an optional design matrix the rows of which include the values of the covariates x for which the conditional distribution of Y|x,D (where D denotes the data) is calculated. These are treated as ‘new’ covariates i.e. they do not contribute to the likelihood. The matrix shouldn't include a column of 1's. NA's can be included to obtain averaged effects. |
offsetPred |
the offset term associated with the new covariates |
StorageDir |
a directory to store files with the posterior samples of models parameters and other quantities of interest. If a directory is not provided, files are created in the current directory and removed when the sampler completes. |
ncomp |
number of mixture components. It defines where the countable mixture of densities [in (1) below] is truncated.
Even if |
sweeps |
total number of posterior samples, including those discarded in burn-in period (see argument |
burn |
length of burn-in period. |
thin |
thinning parameter. |
seed |
optional seed for the random generator. |
H |
optional scale matrix of the Wishart-like prior assigned to the restricted covariance matrices Σ_h^*. See ‘Details’ section. |
Hdf |
optional degrees of freedom of the prior Wishart-like prior assigned to the restricted covariance matrices Σ_h^*. See ‘Details’ section. |
d |
optional prior mean of the mean vector μ_h. See ‘Details’ section. |
D |
optional prior covariance matrix of the mean vector μ_h. See ‘Details’ section. |
Alpha.xi |
an optional parameter that depends on the specified
See ‘Details’ section. |
Beta.xi |
an optional parameter that depends on the specified family.
See ‘Details’ section. |
Alpha.alpha |
optional shape parameter α_{α} of the Gamma prior assigned to the concentration parameter α. See ‘Details’ section. |
Beta.alpha |
optional rate parameter β_{α} of the Gamma prior assigned to concentration parameter α. See ‘Details’ section. |
Trunc.alpha |
optional truncation point c_{α} of the Gamma prior assigned to concentration parameter α. See ‘Details’ section. |
... |
Other options that will be ignored. |
Function dpmj
returns samples from the posterior distributions of the parameters of the model:
f(y_i,x_i) = ∑_{h=1}^{∞} π_h f(y_i,x_i|θ_h), \hspace{200pt} (1)
where y_i is a univariate discrete response, x_i is a p-dimensional vector of mixed type covariates, and π_h, h ≥q 1, are obtained according to Sethuraman's (1994) stick-breaking construction: π_1 = v_1, and for l ≥q 2, π_l = v_l ∏_{j=1}^{l-1} (1-v_j), where v_k are iid samples v_k \simBeta (1,α), k ≥q 1.
Let Z denote a discrete variable (response or covariate). It is represented as discretized version of a continuous latent variable Z^*. Observed discrete Z and continuous latent variable Z^* are connected by:
z = q \iff c_{q-1} < z^* < c_{q}, q=0,1,2,…,
where the cut-points are obtained as: c_{-1} = -∞, while for q ≥q 0, c_{q} = c_{q}(λ) = Φ^{-1}\{F(q;λ)\}. Here Φ(.) is the cumulative distribution function (cdf) of a standard normal variable and F() denotes an appropriate cdf. Further, latent variables are assumed to independently follow a N(0,1) distribution, where the mean and variance are restricted to be zero and one as they are non-identifiable by the data. Choices for F() are described next.
For counts, three options are supported. First, F(.;λ_i) can be specified as the cdf of a Poisson(H_i ξ_h) variable. Here λ_i=(ξ_h,H_i)^T, ξ_h denotes the Poisson rate associated with cluster h, and H_i the offset term associated with sampling unit i. Second, F(.;λ_i) can be specified as the negative binomial cdf, where λ_i= (ξ_{1h},ξ_{2h},H_i)^T. This option allows for overdispersion within each cluster relative to the Poisson distribution. Third, F(.;λ_i) can be specified as the Generalized Poisson cdf, where, again, λ_i=(ξ_{1h},ξ_{2h},H_i)^T. This option allows for both over- and under-dispersion within each cluster.
For Binomial data, two options are supported. First, F(.;λ_i) may be taken to be the cdf of a Binomial(H_i,ξ_h) variable, where ξ_h denotes the success probability of cluster h and H_i the number of trials associated with sampling unit i. Second, F(.;λ_i) may be specified to be the beta-binomial cdf, where λ=(ξ_{1h},ξ_{2h},H_i)^T.
The special case of Binomial data is treated as
Z = 0 \iff z^* < 0, z^* \sim N(μ_z^{*},1).
Details on all kernels are provided in the two tables below. The first table provides the probability mass functions and the mean in the presence of an offset term (which may be taken to be one). The column ‘Sample’ indicates for which parameters the routine provides posterior samples. The second table provides information on the assumed priors along with the default values of the parameters of the prior distributions and it also indicates the function arguments that allow the user to alter these.
Kernel | PMF | Offset | Mean | Sample |
Poisson | \exp(-Hξ) (Hξ)^y /y! | H | H ξ | ξ |
Negative Binomial | \frac{Γ(y+ξ_1)}{Γ(ξ_1)Γ(y+1)}(\frac{ξ_2}{H+ξ_2})^{ξ_1}(\frac{H}{H+ξ_2})^{y} | H | H ξ_1/ξ_2 | ξ_1, ξ_2 |
Generalized Poisson | ξ_1 \{ξ_1+(ξ_2-1)y\}^{y-1} ξ_2^{-y} \times | H | Hξ_1 | ξ_1,ξ_2 |
~~ \exp\{-[ξ_1+(ξ_2-1)y]/ξ_2\}/y! | ||||
Binomial | {N \choose y} ξ^y (1-ξ)^{N-y} | N | N ξ | ξ |
Beta Binomial | {N \choose y} \frac{{Beta}{(y+ξ_1,N-y+ξ_2)}}{{Beta}{(ξ_1,ξ_2)}} | N | N ξ_1/(ξ_1+ξ_2) | ξ_1,ξ_2 |
Kernel | Priors | Default Values |
Poisson | ξ \sim Gamma(α_{ξ},β_{ξ}) | Alpha.xi = 1.0, Beta.xi = 0.1 |
Negative Binomial | ξ_i \sim Gamma(α_{ξ_i},β_{ξ_i}), i=1,2 | Alpha.xi = c(1.0,1.0), Beta.xi = c(0.1,0.1) |
Generalized Poisson | ξ_1 \sim Gamma(α_{ξ_1},β_{ξ_1}) | |
ξ_2 \sim N(α_{ξ_2},β_{ξ_2})I[ξ_2 > 0.05] | Alpha.xi = c(1.0,1.0), Beta.xi = c(0.1,1.0) | |
where β_{ξ_2} denotes st.dev. | ||
Binomial | ξ \sim Beta(α_{ξ},β_{ξ}) | Alpha.xi = 1.0, Beta.xi = 1.0 |
Beta Binomial | ξ_i \sim Gamma(α_{ξ_i},β_{ξ_i}), i=1,2 | Alpha.xi = c(1.0,1.0), Beta.xi = c(0.1,0.1) |
Let z_i = (y_i,x_{i}^T)^T denote the joint vector of observed continuous and discrete variables and z_i^* the corresponding vector of continuous observed and latent variables. With θ_h denoting model parameters associated with the hth cluster, the joint density f(z_{i}|θ_h) takes the form
f(z_i|θ_h) = \int_{R(y)} \int_{R(x_{d})} N_{q}(z_i^*;μ^*_h,Σ^*_h) dx_{d}^{*} dy^{*},
where
\begin{array}{ll} μ^*_h = ≤ft( \begin{array}{l} 0 \\ μ_h \\ \end{array} \right), & Σ^*_h=≤ft[ \begin{array}{ll} C_h & ν_h^T \\ ν_h & Σ_h \\ \end{array} \right] \end{array},
where C_h is the covariance matrix of the latent continuous variables and it has diagonal elements equal to one i.e. it is a correlation matrix.
In addition to the priors defined in the table above, we specify the following:
The restricted covariance matrix Σ^*_h is assigned a prior distribution that is based on the Wishart
distribution with degrees of freedom set by default to dimension of matrix plus two and diagonal scale matrix,
with the sub-matrix that corresponds to discrete variables taken to be the identity matrix and with sub-matrix
that corresponds to continuous variables having entries equal to 1/8 of the square of
the observed data range. Default values can be changed using arguments H
and Hdf
.
The prior on μ_h, the non-zero part of μ_h^*, is taken to be multivariate normal μ_h \sim N(d,D).
The mean d is taken to be equal to the center of the dataset. The covariance matrix D is taken to be diagonal.
Its elements that correspond to continuous variables are set equal to 1/8 of the square of the observed data range while the
elements that correspond to binary variables are set equal to 5.
Arguments Mu.mu
and Sigma.mu
allow the user to change the default values.
The concentration parameter α is assigned a Gamma(α_{α},β_{α})
prior over the range (c_{α},∞), that is,
f(α) \propto α^{α_{α}-1} \exp\{-α β_{α}\} I[α > c_{α}],
where I[.] is the indicator function. The default values are α_{α}=2.0, β_{α}=5.0,
and c_{α}=0.25. Users can alter the default using using arguments Alpha.alpha
, Beta.alpha
and
Turnc.alpha
.
Function dpmj
returns the following:
call |
the matched call. |
seed |
the seed that was used (in case replication of the results is needed). |
meanReg |
if |
medianReg |
if |
q1Reg |
if |
q3Reg |
if |
modeReg |
if |
denReg |
if |
denVar |
if |
Further, function dpmj
creates files where the posterior samples are written. These files are (with all file names
preceded by ‘BNSP.’):
alpha.txt |
this file contains samples from the posterior of the concentration parameters α.
The file is arranged in |
compAlloc.txt |
this file contains the allocations to clusters obtained during posterior sampling.
It consists of |
MeanReg.txt |
this file contains the conditional means of the response y given covariates x obtained during posterior sampling.
The rows represent the |
MedianReg.txt |
this file contains the 50% conditional quantile of the response y given covariates x obtained
during posterior sampling. The rows represent the |
muh.txt |
this file contains samples from the posteriors of the p-dimensional mean vectors μ_h, h=1,2,…, |
nmembers.txt |
this file contains |
Q05Reg.txt |
this file contains the 5% conditional quantile of the response y given covariates x obtained during
posterior sampling. The rows represent the |
Q10Reg.txt |
as above, for the 10% conditional quantile. |
Q15Reg.txt |
as above, for the 15% conditional quantile. |
Q20Reg.txt |
as above, for the 20% conditional quantile. |
Q25Reg.txt |
as above, for the 25% conditional quantile. |
Q75Reg.txt |
as above, for the 75% conditional quantile. |
Q80Reg.txt |
as above, for the 80% conditional quantile. |
Q85Reg.txt |
as above, for the 85% conditional quantile. |
Q90Reg.txt |
as above, for the 90% conditional quantile. |
Q95Reg.txt |
as above, for the 95% conditional quantile. |
Sigmah.txt |
this file contains samples from the posteriors of the q \times q restricted covariance
matrices Σ_h^*, h=1,2,…, |
xih.txt |
this file contains samples from the posteriors of parameters ξ_h, h=1,2,…, |
Updated.txt |
this file contains |
Georgios Papageorgiou gpapageo@gmail.com
Consul, P. C. & Famoye, G. C. (1992). Generalized Poisson regression model. Communications in Statistics - Theory and Methods, 1992, 89-109.
Papageorgiou, G. (2018). Bayesian density regression for discrete outcomes. arXiv:1603.09706v3 [stat.ME].
Papaspiliopoulos, O. (2008). A note on posterior sampling from Dirichlet mixture models. Technical report, University of Warwick.
Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4, 639-650.
Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Communications in Statistics Simulation and Computation, 36(1), 45-54.
#Bayesian nonparametric joint model with binomial response Y and one predictor X data(simD) pred<-seq(with(simD,min(X))+0.1,with(simD,max(X))-0.1,length.out=30) npred<-length(pred) # fit1 and fit2 define the same model but with different numbers of # components and posterior samples fit1 <- dpmj(cbind(Y,(E-Y))~X, Fcdf="binomial", data=simD, ncomp=10, sweeps=20, burn=10, sampler="truncated", Xpred=pred, offsetPred=30) fit2 <- dpmj(cbind(Y,(E-Y))~X, Fcdf="binomial", data=simD, ncomp=50, sweeps=5000, burn=1000, sampler="truncated", Xpred=pred, offsetPred=30) plot(with(simD,X),with(simD,Y)/with(simD,E)) lines(pred,fit2$medianReg/30,col=3,lwd=2) # with discrete covariate simD<-data.frame(simD,Xd=sample(c(0,1),300,replace=TRUE)) pred<-c(0,1) fit3 <- dpmj(cbind(Y,(E-Y))~Xd, Fcdf="binomial", data=simD, ncomp=10, sweeps=20, burn=10, sampler="truncated", Xpred=pred, offsetPred=30)