R: Metropolis-coupled Markov chain Monte Carlo sampler

coupledMetropolis {BayesBinMix}

R Documentation

Metropolis-coupled Markov chain Monte Carlo sampler

Description

Main function of the package. The algorithm consists of the allocation sampler combined with a MC3 scheme.

Usage

coupledMetropolis(Kmax, nChains, heats, binaryData, outPrefix, 
	ClusterPrior, m, alpha, beta, gamma, z.true, ejectionAlpha, burn)

Arguments

`Kmax`	Maximum number of clusters (integer, at least equal to two).
`nChains`	Number of parallel (heated) chains. Ideally, it should be equal to the number of available threads.
`heats`	`nChains`-dimensional vector specifying the temperature of each chain: the 1st entry should always be equal to 1 and the rest of them lie on the set: `(0,1]`.
`binaryData`	The observed binary data (array). Missing values are allowed as long as the corresponding entries are denoted as `NA`.
`outPrefix`	The name of the produced output folder. An error is thrown if the directory exists.
`ClusterPrior`	Character string specifying the prior distribution of the number of clusters on the set `\{1,\ldots,K_{max}\}`. Available options: `poisson` or `uniform`. It defaults to the truncated Poisson distribution.
`m`	The number of MCMC cycles. At the end of each cycle a swap between a pair of heated chains is attempted. Each cycle consists of 10 iterations.
`alpha`	First shape parameter of the Beta prior distribution (strictly positive). Defaults to 1.
`beta`	Second shape parameter of the Beta prior distribution (strictly positive). Defaults to 1.
`gamma`	`Kmax`-dimensional vector (positive) corresponding to the parameters of the Dirichlet prior of the mixture weights. Default value: `rep(1,Kmax)`.
`z.true`	An optional vector of cluster assignments considered as the ground-truth clustering of the observations. Useful for simulations.
`ejectionAlpha`	Probability of ejecting an empty component. Defaults to 0.2.
`burn`	Optional integer denoting the number of MCMC cycles that will be discarded as burn-in period.

Details

In the case that the most probable number of clusters is larger than 1, the output is post-processed using the label.switching package. In addition to the objects returned to the user (see value below), the complete output of the sampler is written to the directory outPrefix. It consists of the following files:

K.allChains.txt m\timesnChains matrix containing the simulated values of the number of clusters (K) per chain.
K.txt the m simulated values of the number of clusters (K) of the cold chain (posterior distribution).
p.varK.txt the simulated values of the mixture weights (not identifiable).
rawMCMC.mapK.KVALUE.txt the raw MCMC output which corresponds to the most probable model (not identifiable).
reorderedMCMC-ECR-ITERATIVE1.mapK.KVALUE.txt the reordered MCMC output which corresponds to the most probable model, reordered according to the ECR-ITERATIVE-1 algorithm.
reorderedMCMC-ECR.mapK.KVALUE.txt the reordered MCMC output which corresponds to the most probable model, reordered according to the ECR algorithm.
reorderedMCMC-STEPHENS.mapK.KVALUE.txt the reordered MCMC output which corresponds to the most probable model, reordered according to the STEPHENS algorithm.
reorderedSingleBestClusterings.mapK.KVALUE.txt the most probable allocation of each observation after reordering the MCMC sample which corresponds to the most probable number of clusters.
theta.varK.txt the simulated values of Bernoulli parameters (not identifiable).
z-ECR-ITERATIVE1.mapK.KVALUE.txt the reordered simulated latent allocations which corresponds to the most probable model, reordered according to the ECR-ITERATIVE-1 algorithm.
z-ECR.mapK.KVALUE.txt the reordered simulated latent allocations which corresponds to the most probable model, reordered according to the ECR algorithm.
z-KL.mapK.KVALUE.txt the reordered simulated latent allocations which corresponds to the most probable model, reordered according to the STEPHENS algorithm.
z.varK.txt the simulated latent allocations (not identifiable).
classificationProbabilities.mapK.KVALUE.csv the reordered classification probabilities per observation after reordering the most probable number of clusters with the ECR algorithm.
xEstimated.txt Observed data with missing values estimated by their posterior mean estimate. This file is produced only in the case that the observed data contains missing values.

KVALUE will be equal to the inferred number of clusters. Note that the label switching part is omitted in case that the most probable number of clusters is equal to 1.

Value

The basic output of the sampler is returned to the following R objects:

`K.mcmc`	object of class `mcmc` (see `coda` package) containing the simulated values (after burn-in) of the number of clusters for the cold chain.
`parameters.ecr.mcmc`	object of class `mcmc` (see `coda` package) containing the simulated values (after burn-in) of `\theta_{kj}` (probability of success per cluster `k` and feature `j`) and `\pi_k` (weight of cluster `k`) for `k = 1,\ldots,K_{\mbox{map}}`; `j = 1,\ldots,d`, where `K_{\mbox{map}}` denotes the most probable number of clusters. The output is reordered according to `ECR` algorithm.
`allocations.ecr.mcmc`	object of class `mcmc` (see `coda` package) containing the simulated values (after burn-in) of `z_{kj}` (allocation variables) for `k = 1,\ldots,K_{\mbox{map}}`, `j = 1,\ldots,d`. The output is reordered according to `ECR` algorithm.
`classificationProbabilities.ecr`	data frame of the reordered classification probabilities per observation after reordering the most probable number of clusters with the `ECR` algorithm.
`clusterMembershipPerMethod`	data frame of the most probable allocation of each observation after reordering the MCMC sample which corresponds to the most probable number of clusters according to `ECR`, `STEPHENS` and `ECR-ITERATIVE-1` methods.
`K.allChains`	`m\timesnChains` matrix containing the simulated values of the number of clusters (`K`) per chain.
`chainInfo`	Number of cycles, burn-in period and acceptance rate of swap moves.

Author(s)

Panagiotis Papastamoulis

References

Altekar G, Dwarkadas S, Huelsenbeck JP, Ronquist F. (2004): Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics 20(3): 407-415.

Nobile A and Fearnside A (2007): Bayesian finite mixtures with an unknown number of components: The allocation sampler. Statistics and Computing, 17(2): 147-162.

Papastamoulis P. and Iliopoulos G. (2010). An artificial allocations based solution to the label switching problem in Bayesian analysis of mixtures of distributions. Journal of Computational and Graphical Statistics, 19: 313-331.

Papastamoulis P. and Iliopoulos G. (2013). On the convergence rate of Random Permutation Sampler and ECR algorithm in missing data models. Methodology and Computing in Applied Probability, 15(2): 293-304.

Papastamoulis P. (2014). Handling the label switching problem in latent class models via the ECR algorithm. Communications in Statistics, Simulation and Computation, 43(4): 913-927.

Papastamoulis P (2016): label.switching: An R package for dealing with the label switching problem in MCMC outputs. Journal of Statistical Software, 69(1): 1-24.

Examples

#generate dataset from a mixture of 2 ten-dimensional Bernoulli distributions.
set.seed(1)
d <- 10 # number of columns
n <- 50 # number of rows (sample size)
K <- 2 	 # true number of clusters
p.true <- myDirichlet(rep(10,K)) # true weight of each cluster
z.true <- numeric(n) # true cluster membership
z.true <- sample(K,n,replace=TRUE,prob = p.true)
#true probability of positive responses per cluster:
theta.true <- array(data = NA, dim = c(K,d)) 
for(j in 1:d){
    theta.true[,j] <- rbeta(K, shape1 = 1, shape2 = 1)
}
x <- array(data=NA,dim = c(n,d)) # data: n X d array
for(k in 1:K){
        myIndex <- which(z.true == k)
        for (j in 1:d){
                x[myIndex,j] <- rbinom(n = length(myIndex), 
			size = 1, prob = theta.true[k,j])   
        }
}
#	number of heated paralled chains
nChains <- 2
heats <- seq(1,0.8,length = nChains)
## Not run: 
cm <- coupledMetropolis(Kmax = 10,nChains = nChains,heats =  heats,
	binaryData = x, outPrefix = 'BayesBinMixExample',
	ClusterPrior = 'poisson', m = 1100, burn = 100)
#	print summary using:
print(cm)

## End(Not run)
# it is also advised to use z.true = z.true in order to directly compare with 
# the true values. In general it is advised to use at least 4 chains with 
#	heats <- seq(1,0.3,length = nChains)

[Package BayesBinMix version 1.4.1 Index]