dpmj {BNSP}R Documentation

Dirichlet process mixtures of joint models

Description

Fits Dirichlet process mixtures of joint response-covariate models, where the covariates are of mixed type while the discrete responses are represented utilizing continuous latent variables. See ‘Details’ section for a full model description and Papageorgiou (2018) for all technical details.

Usage

dpmj(formula, Fcdf, data, offset, sampler = "truncated", Xpred, offsetPred,
     StorageDir, ncomp, sweeps, burn, thin = 1, seed, H, Hdf, d, D,
     Alpha.xi, Beta.xi, Alpha.alpha, Beta.alpha, Trunc.alpha, ...)

Arguments

formula

a formula defining the response and the covariates e.g. y ~ x.

Fcdf

a description of the kernel of the response variable. Currently five options are supported: 1. "poisson", 2. "negative binomial", 3. "generalized poisson", 4. "binomial" and 5. "beta binomial". The first three kernels are used for count data analysis, where the third kernel allows for both over- and under-dispersion relative to the Poisson distribution. The last two kernels are used for binomial data analysis. See ‘Details’ section for some of the kernel details.

data

an optional data frame, list or environment (or object coercible by ‘as.data.frame’ to a data frame) containing the variables in the model. If not found in ‘data’, the variables are taken from ‘environment(formula)’.

offset

this can be used to specify an a priori known component to be included in the model. This should be ‘NULL’ or a numeric vector of length equal to the sample size. One ‘offset’ term can be included in the formula, and if more are required, their sum should be used.

sampler

the MCMC algorithm to be utilized. The two options are sampler = "slice" which implements a slice sampler (Walker, 2007; Papaspiliopoulos, 2008) and sampler = "truncated" which proceeds by truncating the countable mixture at ncomp components (see argument ncomp).

Xpred

an optional design matrix the rows of which include the values of the covariates xx for which the conditional distribution of Yx,DY|x,D (where DD denotes the data) is calculated. These are treated as ‘new’ covariates i.e. they do not contribute to the likelihood. The matrix shouldn't include a column of 1's. NA's can be included to obtain averaged effects.

offsetPred

the offset term associated with the new covariates Xpred. It is of dimension one i.e. the same offset term is used for all rows of Xpred. If Fcdf is one of "poisson" or "negative binomial" or "generalized poisson", then offsetPred is the Poisson offset term. If Fcdf is one of "binomial" or "beta binomial", then offsetPred is the number of Binomial trials. If offsetPred is missing, it is taken to be the mean of offset, rounded to the nearest integer.

StorageDir

a directory to store files with the posterior samples of models parameters and other quantities of interest. If a directory is not provided, files are created in the current directory and removed when the sampler completes.

ncomp

number of mixture components. It defines where the countable mixture of densities [in (1) below] is truncated. Even if sampler="slice" is chosen, ncomp needs to be specified as it is used in the initialization process.

sweeps

total number of posterior samples, including those discarded in burn-in period (see argument burn) and those discarded by the thinning process (see argument thin).

burn

length of burn-in period.

thin

thinning parameter.

seed

optional seed for the random generator.

H

optional scale matrix of the Wishart-like prior assigned to the restricted covariance matrices Σh\Sigma_h^*. See ‘Details’ section.

Hdf

optional degrees of freedom of the prior Wishart-like prior assigned to the restricted covariance matrices Σh\Sigma_h^*. See ‘Details’ section.

d

optional prior mean of the mean vector μh\mu_h. See ‘Details’ section.

D

optional prior covariance matrix of the mean vector μh\mu_h. See ‘Details’ section.

Alpha.xi

an optional parameter that depends on the specified Fcdf argument.

  1. If Fcdf = "poisson", this argument is parameter αξ\alpha_{\xi} of the prior of the Poisson rate: ξ\xi \sim Gamma(αξ,βξ\alpha_{\xi},\beta_{\xi}).

  2. If Fcdf = "negative binomial", this argument is a two-dimensional vector that includes parameters α1ξ\alpha_{1\xi} and α2ξ\alpha_{2\xi} of the priors: ξ1\xi_1 \sim Gamma(α1ξ,β1ξ\alpha_{1\xi},\beta_{1\xi}) and ξ2\xi_2 \sim Gamma(α2ξ,β2ξ\alpha_{2\xi},\beta_{2\xi}), where ξ1\xi_1 and ξ2\xi_2 are the two parameters of the Negative Binomial pmf.

  3. If Fcdf = "generalized poisson", this argument is a two-dimensional vector that includes parameters α1ξ\alpha_{1\xi} and α2ξ\alpha_{2\xi} of the priors: ξ1\xi_1 \sim Gamma(α1ξ,β1ξ\alpha_{1\xi},\beta_{1\xi}) and ξ2\xi_2 \sim N(α2ξ,β2ξ)I[ξ2Rξ2]\alpha_{2\xi},\beta_{2\xi})I[\xi_2 \in R_{\xi_2}], where ξ1\xi_1 and ξ2\xi_2 are the two parameters of the Generalized Poisson pmf. Parameter ξ2\xi_2 is restricted in the range Rξ2=(0.05,)R_{\xi_2} = (0.05,\infty) as it is a dispersion parameter.

  4. If Fcdf = "binomial", this argument is parameter αξ\alpha_{\xi} of the prior of the Binomial probability: ξ\xi \sim Beta(αξ,βξ\alpha_{\xi},\beta_{\xi}).

  5. If Fcdf = "beta binomial", this argument is a two-dimensional vector that includes parameters α1ξ\alpha_{1\xi} and α2ξ\alpha_{2\xi} of the priors: ξ1\xi_1 \sim Gamma(α1ξ,β1ξ\alpha_{1\xi},\beta_{1\xi}) and ξ2\xi_2 \sim Gamma(α2ξ,β2ξ\alpha_{2\xi},\beta_{2\xi}), where ξ1\xi_1 and ξ2\xi_2 are the two parameters of the Beta Binomial pmf.

See ‘Details’ section.

Beta.xi

an optional parameter that depends on the specified family.

  1. If Fcdf = "poisson", this argument is parameter βξ\beta_{\xi} of the prior of the Poisson rate: ξ\xi \sim Gamma(αξ,βξ\alpha_{\xi},\beta_{\xi}).

  2. If Fcdf = "negative binomial", this argument is a two-dimensional vector that includes parameters β1ξ\beta_{1\xi} and β2ξ\beta_{2\xi} of the priors: ξ1\xi_1 \sim Gamma(α1ξ,β1ξ\alpha_{1\xi},\beta_{1\xi}) and ξ2\xi_2 \sim Gamma(α2ξ,β2ξ\alpha_{2\xi},\beta_{2\xi}), where ξ1\xi_1 and ξ2\xi_2 are the two parameters of the Negative Binomial pmf.

  3. If Fcdf = "generalized poisson", this argument is a two-dimensional vector that includes parameters β1ξ\beta_{1\xi} and β2ξ\beta_{2\xi} of the priors: ξ1\xi_1 \sim Gamma(α1ξ,β1ξ\alpha_{1\xi},\beta_{1\xi}) and ξ2\xi_2 \sim Normal(α2ξ,β2ξ)I[ξ2Rξ2]\alpha_{2\xi},\beta_{2\xi})I[\xi_2 \in R_{\xi_2}], where ξ1\xi_1 and ξ2\xi_2 are the two parameters of the Generalized Poisson pmf. Parameter ξ2\xi_2 is restricted in the range Rξ2=(0.05,)R_{\xi_2} = (0.05,\infty) as it is a dispersion parameter. Note that β2ξ\beta_{2\xi} is a standard deviation.

  4. If Fcdf = "binomial", this argument is parameter βξ\beta_{\xi} of the prior of the Binomial probability: ξ\xi \sim Beta(αξ,βξ\alpha_{\xi},\beta_{\xi}).

  5. If Fcdf = "beta binomial", this argument is a two-dimensional vector that includes parameters β1ξ\beta_{1\xi} and β2ξ\beta_{2\xi} of the priors: ξ1\xi_1 \sim Gamma(α1ξ,β1ξ\alpha_{1\xi},\beta_{1\xi}) and ξ2\xi_2 \sim Gamma(α2ξ,β2ξ\alpha_{2\xi},\beta_{2\xi}), where ξ1\xi_1 and ξ2\xi_2 are the two parameters of the Beta Binomial pmf.

See ‘Details’ section.

Alpha.alpha

optional shape parameter αα\alpha_{\alpha} of the Gamma prior assigned to the concentration parameter α\alpha. See ‘Details’ section.

Beta.alpha

optional rate parameter βα\beta_{\alpha} of the Gamma prior assigned to concentration parameter α\alpha. See ‘Details’ section.

Trunc.alpha

optional truncation point cαc_{\alpha} of the Gamma prior assigned to concentration parameter α\alpha. See ‘Details’ section.

...

Other options that will be ignored.

Details

Function dpmj returns samples from the posterior distributions of the parameters of the model:

f(yi,xi)=h=1πhf(yi,xiθh),(1) f(y_i,x_i) = \sum_{h=1}^{\infty} \pi_h f(y_i,x_i|\theta_h), \hspace{200pt} (1)

where yiy_i is a univariate discrete response, xix_i is a pp-dimensional vector of mixed type covariates, and πh,h1,\pi_h, h \geq 1, are obtained according to Sethuraman's (1994) stick-breaking construction: π1=v1\pi_1 = v_1, and for l2,πl=vlj=1l1(1vj)l \geq 2, \pi_l = v_l \prod_{j=1}^{l-1} (1-v_j), where vkv_k are iid samples vkv_k \simBeta (1,α),k1.(1,\alpha), k \geq 1.

Let ZZ denote a discrete variable (response or covariate). It is represented as discretized version of a continuous latent variable ZZ^*. Observed discrete ZZ and continuous latent variable ZZ^* are connected by:

z=q    cq1<z<cq,q=0,1,2,, z = q \iff c_{q-1} < z^* < c_{q}, q=0,1,2,\dots,

where the cut-points are obtained as: c1=c_{-1} = -\infty, while for q0q \geq 0, cq=cq(λ)=Φ1{F(q;λ)}.c_{q} = c_{q}(\lambda) = \Phi^{-1}\{F(q;\lambda)\}. Here Φ(.)\Phi(.) is the cumulative distribution function (cdf) of a standard normal variable and F()F() denotes an appropriate cdf. Further, latent variables are assumed to independently follow a N(0,1)N(0,1) distribution, where the mean and variance are restricted to be zero and one as they are non-identifiable by the data. Choices for F()F() are described next.

For counts, three options are supported. First, F(.;λi)F(.;\lambda_i) can be specified as the cdf of a Poisson(Hiξh)(H_i \xi_h) variable. Here λi=(ξh,Hi)T,ξh\lambda_i=(\xi_h,H_i)^T, \xi_h denotes the Poisson rate associated with cluster hh, and HiH_i the offset term associated with sampling unit ii. Second, F(.;λi)F(.;\lambda_i) can be specified as the negative binomial cdf, where λi=(ξ1h,ξ2h,Hi)T\lambda_i= (\xi_{1h},\xi_{2h},H_i)^T. This option allows for overdispersion within each cluster relative to the Poisson distribution. Third, F(.;λi)F(.;\lambda_i) can be specified as the Generalized Poisson cdf, where, again, λi=(ξ1h,ξ2h,Hi)T\lambda_i=(\xi_{1h},\xi_{2h},H_i)^T. This option allows for both over- and under-dispersion within each cluster.

For Binomial data, two options are supported. First, F(.;λi)F(.;\lambda_i) may be taken to be the cdf of a Binomial(Hi,ξh)(H_i,\xi_h) variable, where ξh\xi_h denotes the success probability of cluster hh and HiH_i the number of trials associated with sampling unit ii. Second, F(.;λi)F(.;\lambda_i) may be specified to be the beta-binomial cdf, where λ=(ξ1h,ξ2h,Hi)T\lambda=(\xi_{1h},\xi_{2h},H_i)^T.

The special case of Binomial data is treated as

Z=0    z<0,zN(μz,1). Z = 0 \iff z^* < 0, z^* \sim N(\mu_z^{*},1).

Details on all kernels are provided in the two tables below. The first table provides the probability mass functions and the mean in the presence of an offset term (which may be taken to be one). The column ‘Sample’ indicates for which parameters the routine provides posterior samples. The second table provides information on the assumed priors along with the default values of the parameters of the prior distributions and it also indicates the function arguments that allow the user to alter these.

Kernel PMF Offset Mean Sample
Poisson exp(Hξ)(Hξ)y/y!\exp(-H\xi) (H\xi)^y /y! HH HξH \xi ξ\xi
Negative Binomial Γ(y+ξ1)Γ(ξ1)Γ(y+1)(ξ2H+ξ2)ξ1(HH+ξ2)y\frac{\Gamma(y+\xi_1)}{\Gamma(\xi_1)\Gamma(y+1)}(\frac{\xi_2}{H+\xi_2})^{\xi_1}(\frac{H}{H+\xi_2})^{y} HH Hξ1/ξ2H \xi_1/\xi_2 ξ1,ξ2\xi_1, \xi_2
Generalized Poisson ξ1{ξ1+(ξ21)y}y1ξ2y×\xi_1 \{\xi_1+(\xi_2-1)y\}^{y-1} \xi_2^{-y} \times HH Hξ1H\xi_1 ξ1,ξ2\xi_1,\xi_2
  exp{[ξ1+(ξ21)y]/ξ2}/y! ~~ \exp\{-[\xi_1+(\xi_2-1)y]/\xi_2\}/y!
Binomial (Ny)ξy(1ξ)Ny{N \choose y} \xi^y (1-\xi)^{N-y} NN NξN \xi ξ\xi
Beta Binomial (Ny)Beta(y+ξ1,Ny+ξ2)Beta(ξ1,ξ2){N \choose y} \frac{{Beta}{(y+\xi_1,N-y+\xi_2)}}{{Beta}{(\xi_1,\xi_2)}} NN Nξ1/(ξ1+ξ2)N \xi_1/(\xi_1+\xi_2) ξ1,ξ2\xi_1,\xi_2
Kernel Priors Default Values
Poisson ξ\xi \sim Gamma(αξ,βξ)(\alpha_{\xi},\beta_{\xi}) Alpha.xi = 1.0, Beta.xi = 0.1
Negative Binomial ξi\xi_i \sim Gamma(αξi,βξi),i=1,2(\alpha_{\xi_i},\beta_{\xi_i}), i=1,2 Alpha.xi = c(1.0,1.0), Beta.xi = c(0.1,0.1)
Generalized Poisson ξ1\xi_1 \sim Gamma(αξ1,βξ1)(\alpha_{\xi_1},\beta_{\xi_1})
ξ2\xi_2 \sim N(αξ2,βξ2)I[ξ2>0.05](\alpha_{\xi_2},\beta_{\xi_2})I[\xi_2 > 0.05] Alpha.xi = c(1.0,1.0), Beta.xi = c(0.1,1.0)
where βξ2\beta_{\xi_2} denotes st.dev.
Binomial ξ\xi \sim Beta(αξ,βξ)(\alpha_{\xi},\beta_{\xi}) Alpha.xi = 1.0, Beta.xi = 1.0
Beta Binomial ξi\xi_i \sim Gamma(αξi,βξi),i=1,2(\alpha_{\xi_i},\beta_{\xi_i}), i=1,2 Alpha.xi = c(1.0,1.0), Beta.xi = c(0.1,0.1)

Let zi=(yi,xiT)Tz_i = (y_i,x_{i}^T)^T denote the joint vector of observed continuous and discrete variables and ziz_i^* the corresponding vector of continuous observed and latent variables. With θh\theta_h denoting model parameters associated with the hhth cluster, the joint density f(ziθh)f(z_{i}|\theta_h) takes the form

f(ziθh)=R(y)R(xd)Nq(zi;μh,Σh)dxddy, f(z_i|\theta_h) = \int_{R(y)} \int_{R(x_{d})} N_{q}(z_i^*;\mu^*_h,\Sigma^*_h) dx_{d}^{*} dy^{*},

where

μh=(0μh),Σh=[ChνhTνhΣh], \begin{array}{ll} \mu^*_h = \left( \begin{array}{l} 0 \\ \mu_h \\ \end{array} \right), & \Sigma^*_h=\left[ \begin{array}{ll} C_h & \nu_h^T \\ \nu_h & \Sigma_h \\ \end{array} \right] \end{array},

where ChC_h is the covariance matrix of the latent continuous variables and it has diagonal elements equal to one i.e. it is a correlation matrix.

In addition to the priors defined in the table above, we specify the following:

  1. The restricted covariance matrix Σh\Sigma^*_h is assigned a prior distribution that is based on the Wishart distribution with degrees of freedom set by default to dimension of matrix plus two and diagonal scale matrix, with the sub-matrix that corresponds to discrete variables taken to be the identity matrix and with sub-matrix that corresponds to continuous variables having entries equal to 1/8 of the square of the observed data range. Default values can be changed using arguments H and Hdf.

  2. The prior on μh\mu_h, the non-zero part of μh\mu_h^*, is taken to be multivariate normal μhN(d,D)\mu_h \sim N(d,D). The mean dd is taken to be equal to the center of the dataset. The covariance matrix DD is taken to be diagonal. Its elements that correspond to continuous variables are set equal to 1/8 of the square of the observed data range while the elements that correspond to binary variables are set equal to 5. Arguments Mu.mu and Sigma.mu allow the user to change the default values.

  3. The concentration parameter α\alpha is assigned a Gamma(αα,βα)(\alpha_{\alpha},\beta_{\alpha}) prior over the range (cα,)(c_{\alpha},\infty), that is, f(α)ααα1exp{αβα}I[α>cα]f(\alpha) \propto \alpha^{\alpha_{\alpha}-1} \exp\{-\alpha \beta_{\alpha}\} I[\alpha > c_{\alpha}], where I[.]I[.] is the indicator function. The default values are αα=2.0,βα=5.0\alpha_{\alpha}=2.0, \beta_{\alpha}=5.0, and cα=0.25c_{\alpha}=0.25. Users can alter the default using using arguments Alpha.alpha, Beta.alpha and Turnc.alpha.

Value

Function dpmj returns the following:

call

the matched call.

seed

the seed that was used (in case replication of the results is needed).

meanReg

if Xpred is specified, the function returns the posterior mean of the conditional expectation of the response yy given each new covariate xx.

medianReg

if Xpred is specified, the function returns the posterior mean of the conditional 50% quantile of the response yy given each new covariate xx.

q1Reg

if Xpred is specified, the function returns the posterior mean of the conditional 25% quantile of the response yy given each new covariate xx.

q3Reg

if Xpred is specified, the function returns the posterior mean of the conditional 75% quantile of the response yy given each new covariate xx.

modeReg

if Xpred is specified, the function returns the posterior mean of the conditional mode of the response yy given each new covariate xx.

denReg

if Xpred is specified, the function returns the posterior mean conditional density of the response yy given each new covariate xx. Results are presented in a matrix the rows of which correspond to the different xxs.

denVar

if Xpred is specified, the function returns the posterior variance of the conditional density of the response yy given each new covariate xx. Results are presented in a matrix the rows of which correspond to the different xxs.

Further, function dpmj creates files where the posterior samples are written. These files are (with all file names preceded by ‘BNSP.’):

alpha.txt

this file contains samples from the posterior of the concentration parameters α\alpha. The file is arranged in (sweeps-burn)/thin lines and one column, each line including one posterior sample.

compAlloc.txt

this file contains the allocations to clusters obtained during posterior sampling. It consists of (sweeps-burn)/thin lines, that represent the posterior samples, and nn columns, that represent the sampling units. Clusters are represented by integers ranging from 0 to ncomp-1.

MeanReg.txt

this file contains the conditional means of the response yy given covariates xx obtained during posterior sampling. The rows represent the (sweeps-burn)/thin posterior samples. The columns represent the various covariate values xx for which the means are obtained.

MedianReg.txt

this file contains the 50% conditional quantile of the response yy given covariates xx obtained during posterior sampling. The rows represent the (sweeps-burn)/thin posterior samples. The columns represent the various covariate values xx for which the medians are obtained.

muh.txt

this file contains samples from the posteriors of the pp-dimensional mean vectors μh,h=1,2,\mu_h, h=1,2,\dots,ncomp. The file is arranged in ((sweeps-burn)/thin)*ncomp lines and pp columns. In more detail, sweeps create ncomp lines representing samples μh(sw),h=1,,\mu_h^{(sw)}, h=1,\dots,ncomp, where superscript swsw represents a particular sweep. The elements of μh(sw)\mu_h^{(sw)} are written in the columns of the file.

nmembers.txt

this file contains (sweeps-burn)/thin lines and ncomp columns, where the lines represent posterior samples while the columns represent the components or clusters. The entries represent the number of sampling units allocated to each component.

Q05Reg.txt

this file contains the 5% conditional quantile of the response yy given covariates xx obtained during posterior sampling. The rows represent the (sweeps-burn)/thin posterior samples. The columns represent the various covariate values xx for which the quantiles are obtained.

Q10Reg.txt

as above, for the 10% conditional quantile.

Q15Reg.txt

as above, for the 15% conditional quantile.

Q20Reg.txt

as above, for the 20% conditional quantile.

Q25Reg.txt

as above, for the 25% conditional quantile.

Q75Reg.txt

as above, for the 75% conditional quantile.

Q80Reg.txt

as above, for the 80% conditional quantile.

Q85Reg.txt

as above, for the 85% conditional quantile.

Q90Reg.txt

as above, for the 90% conditional quantile.

Q95Reg.txt

as above, for the 95% conditional quantile.

Sigmah.txt

this file contains samples from the posteriors of the q×qq \times q restricted covariance matrices Σh,h=1,2,,\Sigma_h^*, h=1,2,\dots,ncomp. The file is arranged in ((sweeps-burn)/thin)*ncomp lines and q2q^2 columns. In more detail, sweeps create ncomp lines representing samples Σh(sw),h=1,,\Sigma_h^{(sw)}, h=1,\dots,ncomp, where superscript swsw represents a particular sweep. The elements of Σh(sw)\Sigma_h^{(sw)} are written in the columns of the file.

xih.txt

this file contains samples from the posteriors of parameters ξh\xi_h, h=1,2,,h=1,2,\dots,ncomp. The file is arranged in ((sweeps-burn)/thin)*ncomp lines and one or two columns, depending on the number of parameters in the selected Fcdf. Sweeps write in the file ncomp lines representing samples ξh(sw),h=1,,\xi_h^{(sw)}, h=1,\dots,ncomp, where superscript swsw represents a particular sweep.

Updated.txt

this file contains (sweeps-burn)/thin lines with the number of components updated at each iteration of the sampler (relevant for slice sampling).

Author(s)

Georgios Papageorgiou gpapageo@gmail.com

References

Consul, P. C. & Famoye, G. C. (1992). Generalized Poisson regression model. Communications in Statistics - Theory and Methods, 1992, 89-109.

Papageorgiou, G. (2018). Bayesian density regression for discrete outcomes. arXiv:1603.09706v3 [stat.ME].

Papaspiliopoulos, O. (2008). A note on posterior sampling from Dirichlet mixture models. Technical report, University of Warwick.

Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4, 639-650.

Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Communications in Statistics Simulation and Computation, 36(1), 45-54.

Examples

#Bayesian nonparametric joint model with binomial response Y and one predictor X
data(simD)
pred<-seq(with(simD,min(X))+0.1,with(simD,max(X))-0.1,length.out=30)
npred<-length(pred)
# fit1 and fit2 define the same model but with different numbers of
# components and posterior samples
fit1 <- dpmj(cbind(Y,(E-Y))~X, Fcdf="binomial", data=simD, ncomp=10, sweeps=20,
             burn=10, sampler="truncated", Xpred=pred, offsetPred=30)
fit2 <- dpmj(cbind(Y,(E-Y))~X, Fcdf="binomial", data=simD, ncomp=50, sweeps=5000,
               burn=1000, sampler="truncated", Xpred=pred, offsetPred=30)
plot(with(simD,X),with(simD,Y)/with(simD,E))
lines(pred,fit2$medianReg/30,col=3,lwd=2)
# with discrete covariate
simD<-data.frame(simD,Xd=sample(c(0,1),300,replace=TRUE))
pred<-c(0,1)
fit3 <- dpmj(cbind(Y,(E-Y))~Xd, Fcdf="binomial", data=simD, ncomp=10, sweeps=20,
             burn=10, sampler="truncated", Xpred=pred, offsetPred=30)

[Package BNSP version 2.2.3 Index]