DPMCMC {DIRECT}R Documentation

Dirichlet Process-Based Markov Chain Monte Carlo (MCMC) Sampler for Mixture Model-Based Clustering

Description

The MCMC sampler for DIRECT. In each MCMC iteration, the function updates cluster memberships for all items, allowing for changes in the number of clusters (mixture components). This update implements a Metropolis-Hastings (MH) sampler developed in Fu et al. (2013), and an MH sampler developed in Neal (2000). It also updates parameters specific to each mixture components via MH sampling. Parameters of interest include the mean vector and standard deviations of the three types of variability. Additionally, it updates \alpha, the concentration parameter in the Dirichlet-process prior, allowing for Gibbs (Escobar and West, 1995) and MH sampling.

Usage

DPMCMC(file.mcmc.cs, file.mcmc.pars, file.mcmc.probs, file.size, 
       data, SKIP, nTime, times, c.curr, par.prior, 
       PAR.INIT = FALSE, 
       sdWICluster.curr = 0.5, sdTSampling.curr = 0.5, 
       sdResidual.curr = 0.5, alpha.curr, 
       alpha.prior.shape, alpha.prior.rate, sd.prop, 
       nIter, burn.in, step.size, nRepeat = 1, nResample, seed.value, 
       RNORM.METHOD = c("chol", "eigen", "svd"), 
       SAMPLE.C = c("FRBT", "Neal"), 
       PRIOR.MODEL = c("none", "OU", "BM", "BMdrift"), 
       ALPHA.METHOD = c("Gibbs", "MH"), 
       OUTPUT.CLUST.SIZE = FALSE, PRINT = FALSE)

Arguments

file.mcmc.cs

A character string in quotation marks indicating the output filename for cluster memberships and \alpha.

file.mcmc.pars

A character string in quotation marks indicating the output filename for MCMC samples of parameters specific to mixture components.

file.mcmc.probs

A character string in quotation marks indicating the output filename for posterior allocation probability matrices from the resampling step.

file.size

A character string in quotation marks indicating the output filename for cluster sizes.

data

An N \times JR matrix of continuous values, or a data frame containing such a matrix. N is the number if items, J the number of time points (or experimental conditions) and R the number of replicates. Each row contains values for Replicates 1 through R under Condition 1, values for Replicates 1 through R under Condition 2, and so on.

SKIP

Number of columns in data to be skipped when processing the data.

nTime

Number of time points (or experimental conditions).

times

An integer vector of length nTime, indicating times (or experimental conditions).

c.curr

An integer vector of length N, indicating initial cluster assignments for items 1 through N.

par.prior

A list that contains parameters of the prior distributions. It has the following format: par.prior = list (uWICluster=???, uTSampling=???, uResidual=???, mean=???, meanMT1=???, sdMT1=???, meanMTProc=???, sdMTProc=???, uSDT1=???, uSDProc=???, shapeBetaProc=???, rateBetaProc=???). See DIRECT for possible values of the list components.

PAR.INIT

Logical value. Generate initial values for the standard deviations of the three types of variability if TRUE. Use the input values otherwise.

sdWICluster.curr

A positive scalar. Initial value of the standard deviation of the within-cluster variability. Ignored if PAR.INIT=TRUE.

sdTSampling.curr

A positive scalar. Initial value of the standard deviation of the variability across time points. Ignored if PAR.INIT=TRUE.

sdResidual.curr

A positive scalar. Initial value of the standard deviation of the residual variability (i.e., variability across replicates). Ignored if PAR.INIT=TRUE.

alpha.curr

A positive scalar. Initial value of \alpha, the concentration parameter of the Dirichlet-process prior.

alpha.prior.shape

A positive scalar. The shape parameter in the beta prior for \alpha, the concentration parameter of the Dirichlet-process prior.

alpha.prior.rate

A positive scalar. The rate parameter in the beta prior for \alpha, the concentration parameter of the Dirichlet-process prior.

sd.prop

A list that contains standard deviations in the proposal distributions for some key parameters. It has the following format: sd.prop=list (WICluster=???, TSampling=???, Residual=???, alpha=???). ??? needs to be filled in with positive values. See DIRECT.

nIter

The number of MCMC iterations.

burn.in

A value in (0,1) indicating the percentage of the MCMC iterations to be used as burn-in and be discarded in posterior inference.

step.size

An integer indicating the number of MCMC iterations to be skipped between two recorded MCMC samples.

nRepeat

An integer \geq 1 indicating the number of times to update the cluster memberships for all items. Useful only when SAMPLE.C="Neal".

nResample

An integer \geq 1 indicating the number of resamples to draw to estimate the posterior mixing proportions.

seed.value

A positive value used in random number generation.

RNORM.METHOD

Method to compute the determinant of the covariance matrix in the calculation of the multivariate normal density. Required. Method choices are: "chol" for Choleski decomposition, "eigen" for eigenvalue decomposition, and "svd" for singular value decomposition.

SAMPLE.C

Method to update cluster memberships. Required. Method choices are: "FRBT" for the Metropolis-Hastings sampler based on a discrete uniform proposal distribution developed in Fu, Russell, Bray and Tavare, and "Neal" for the Metropolis-Hastings sampler developed in Neal (2000).

PRIOR.MODEL

Model to generate realizations of the mean vector of a mixture component. Required. Choices are: "none" for not assuming a stochastic process and using a zero vector, "OU" for an Ornstein-Uhlenbeck process (a.k.a. the mean-reverting process), "BM" for a Brown motion (without drift), and "BMdrift" for a Brownian motion with drift.

ALPHA.METHOD

Method to update \alpha, the concentration parameter of the Dirichlet-process prior. Required. Choices are: "Gibbs" for a Gibbs sampler developed in Escobar and West (1995), and "MH" for a Metropolis-Hastings sampler.

OUTPUT.CLUST.SIZE

If TRUE, output cluster sizes in MCMC iterations into an external file *_mcmc_size.out.

PRINT

If TRUE, print intermediate values during an MCMC run onto the screen. Used for debugging for small data sets.

Details

The MCMC sampling step in DIRECT is accomplished with DPMCMC. DPMCMC generates MCMC samples of assignments to mixture components (number of components implicitly generated; written into external file *_mcmc_cs.out) and component-specific parameters (written into external file *_mcmc_pars.out), which include mean vectors and standard deviations of three types of variability.

Value

At least two files are generated by DPMCMC and placed under the current working directory:

  1. *_mcmc_cs.out: Generated from MCMC sampling. Each row contains an MCMC sample of assignments of items to mixture components, or cluster memberships if a component is defined as a cluster, as well as \alpha, the concentration parameter in the Dirichlet-process prior.

  2. *_mcmc_pars.out: Generated from MCMC sampling. Each row contains an MCMC sample of parameters specific to a mixture component. Multiple rows may come from the same MCMC iteration.

If argument OUTPUT.CLUST.SIZE=TRUE, an additional file *_mcmc_size.out is also generated, which contains the cluster sizes of each recorded MCMC sample.

Author(s)

Audrey Q. Fu

References

Escobar, M. D. and West, M. (1995) Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90: 577-588.

Fu, A. Q., Russell, S., Bray, S. and Tavare, S. (2013) Bayesian clustering of replicated time-course gene expression data with weak signals. The Annals of Applied Statistics. 7(3) 1334-1361.

Neal, R. M. (2000) Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9: 249-265.

See Also

DIRECT, which calls DPMCMC.

Examples

## See example in DIRECT.

[Package DIRECT version 1.1.0 Index]