R: Bayesian Clustering with the Dirichlet-Process Prior

DIRECT {DIRECT}

R Documentation

Bayesian Clustering with the Dirichlet-Process Prior

Description

A Bayesian clustering method for multivariate data, e.g., replicated time-course microarray gene expression data. This method uses a mixture random-effects model that decomposes the total variability in the data into within-cluster variability, variability across experimental conditions (e.g., time points), and variability in replicates (i.e., residual variability). It also uses a Dirichlet-process prior to induce a distribution on the number of clusters as well as clustering. Metropolis-Hastings Markov chain Monte Carlo procedures are used for parameter estimation.

Usage

DIRECT (data, data.name = "Output", 
        SKIP = 0, nTime, times = 1:nTime, 
        c.curr, 
        uWICluster = 1, uTSampling = 1, uResidual = 1, 
        meanVec = rep(0, nTime), meanMT1 = 0, sdMT1 = 0.2, 
        meanMTProc = 0, sdMTProc = 0.5, uSDT1 = 0.2, uSDProc = 1, 
        shapeBetaProc = 0.5, rateBetaProc = 0.5, 
        PAR.INIT = TRUE, 
        sdWICluster.curr = 0.5, sdTSampling.curr = 0.5, 
        sdResidual.curr = 0.5, alpha.curr = 0.01, 
        alpha.prior.shape = 0.01, alpha.prior.rate = 1, 
        WICluster.prop.sd = 0.2, TSampling.prop.sd = 0.2, 
        Residual.prop.sd = 0.2, alpha.prop.sd = 0.2, 
        nIter, burn.in, step.size, nRepeat = 1, nResample, 
        seed.value, 
        RNORM.METHOD = c("chol", "eigen", "svd"), 
        SAMPLE.C = c("FRBT", "Neal"), 
        PRIOR.MODEL = c("none", "OU", "BM", "BMdrift"), 
        ALPHA.METHOD = c("Gibbs", "MH"), 
        RELABEL.THRESHOLD = 0.01, 
        OUTPUT.CLUST.SIZE = FALSE, PRINT = FALSE)

Arguments

`data`	An `N \times JR` matrix of continuous values, or a data frame containing such a matrix. `N` is the number if items, `J` the number of time points (or experimental conditions) and `R` the number of replicates. Each row contains values for Replicates 1 through `R` under Condition 1, values for Replicates 1 through `R` under Condition 2, and so on.
`data.name`	A character string used as the prefix of output files.
`SKIP`	Number of columns in `data` to be skipped when processing the data.
`nTime`	Number of time points (or experimental conditions).
`times`	An integer vector of length `nTime`, indicating times (or experimental conditions).
`c.curr`	An integer vector of length `N`, indicating initial cluster assignments for items 1 through `N`.
`uWICluster`	Upper bound of the uniform prior assigned to the standard deviation of within-cluster variability. The lower bound of the uniform prior is 0.
`uTSampling`	Upper bound of the uniform prior assigned to the standard deviation of variability due to sampling across time points (or experimental conditions). The lower bound of the uniform prior is 0.
`uResidual`	Upper bound of the uniform prior assigned to the standard deviation of residual variability (i.e., variability across replicates). The lower bound of the uniform prior is 0.
`meanVec`	Prior mean vector of length `nTime`. Required if `PRIOR.MODEL="none"`.
`meanMT1`	Prior mean (scalar) of the mean at the first time point. Required if `PRIOR.MODEL` is one of the stochastic processes ("OU", "BM" and "BMdrift").
`sdMT1`	A positive scalar. Prior standard deviation (scalar) of the mean at the first time point. Required if `PRIOR.MODEL` is one of the stochastic processes ("OU", "BM" and "BMdrift").
`meanMTProc`	Prior mean (scalar) of the mean across time points. Required if `PRIOR.MODEL` is one of the stochastic processes ("OU", "BM" and "BMdrift"). Set to 0 if `PRIOR.MODEL="BM"`.
`sdMTProc`	A positive scalar. Prior standard deviation (scalar) of the mean across time points. Required if `PRIOR.MODEL` is one of the stochastic processes ("OU", "BM" and "BMdrift").
`uSDT1`	The upper bound of the uniform prior assigned to the standard deviation at the first time point. The lower bound of the uniform prior is 0. Required if `PRIOR.MODEL` is one of the stochastic processes ("OU", "BM" and "BMdrift").
`uSDProc`	The upper bound of the uniform prior assigned to the standard deviation across time points. The lower bound of the uniform prior is 0. Required if `PRIOR.MODEL` is one of the stochastic processes ("OU", "BM" and "BMdrift").
`shapeBetaProc`	A positive scalar. The shape parameter in the beta prior for the mean-reverting rate in an Ornstein-Uhlenbeck process. Required if `PRIOR.MODEL="OU"`.
`rateBetaProc`	A positive scalar. The rate parameter in the beta prior for the mean-reverting rate in an Ornstein-Uhlenbeck process. Required if `PRIOR.MODEL="OU"`.
`PAR.INIT`	Logical value. Generate initial values for the standard deviations of the three types of variability if TRUE. Use the input values otherwise.
`sdWICluster.curr`	A positive scalar. Initial value of the standard deviation of the within-cluster variability. Ignored if `PAR.INIT=TRUE`.
`sdTSampling.curr`	A positive scalar. Initial value of the standard deviation of the variability across time points. Ignored if `PAR.INIT=TRUE`.
`sdResidual.curr`	A positive scalar. Initial value of the standard deviation of the residual variability (i.e., variability across replicates). Ignored if `PAR.INIT=TRUE`.
`alpha.curr`	A positive scalar. Initial value of `\alpha`, the concentration parameter of the Dirichlet-process prior.
`alpha.prior.shape`	A positive scalar. The shape parameter in the beta prior for `\alpha`, the concentration parameter of the Dirichlet-process prior.
`alpha.prior.rate`	A positive scalar. The rate parameter in the beta prior for `\alpha`, the concentration parameter of the Dirichlet-process prior.
`WICluster.prop.sd`	A positive scalar. The standard deviation in the proposal distribution (normal) for the standard deviation of the within-cluster variability.
`TSampling.prop.sd`	A positive scalar. The standard deviation in the proposal distribution (normal) for the standard deviation of the variability across time points.
`Residual.prop.sd`	A positive scalar. The standard deviation in the proposal distribution (normal) for the standard deviation of the residual variability (i.e., variability across replicates).
`alpha.prop.sd`	A positive scalar. The standard deviation in the proposal distribution (normal) for `\alpha`, the concentration parameter of the Dirichlet-process prior. Ignored if `ALPHA.METHOD="Gibbs"`.
`nIter`	The number of MCMC iterations.
`burn.in`	A value in (0,1) indicating the percentage of the MCMC iterations to be used as burn-in and be discarded in posterior inference.
`step.size`	An integer indicating the number of MCMC iterations to be skipped between two recorded MCMC samples.
`nRepeat`	An integer `\geq 1` indicating the number of times to update the cluster memberships for all items. Useful only when `SAMPLE.C="Neal"`.
`nResample`	An integer `\geq 1` indicating the number of resamples to draw to estimate the posterior mixing proportions.
`seed.value`	A positive value used in random number generation.
`RNORM.METHOD`	Method to compute the determinant of the covariance matrix in the calculation of the multivariate normal density. Required. Method choices are: "chol" for Choleski decomposition, "eigen" for eigenvalue decomposition, and "svd" for singular value decomposition.
`SAMPLE.C`	Method to update cluster memberships. Required. Method choices are: "FRBT" for the Metropolis-Hastings sampler based on a discrete uniform proposal distribution developed in Fu, Russell, Bray and Tavare, and "Neal" for the Metropolis-Hastings sampler developed in Neal (2000).
`PRIOR.MODEL`	Model to generate realizations of the mean vector of a mixture component. Required. Choices are: "none" for not assuming a stochastic process and using a zero vector, "OU" for an Ornstein-Uhlenbeck process (a.k.a. the mean-reverting process), "BM" for a Brown motion (without drift), and "BMdrift" for a Brownian motion with drift.
`ALPHA.METHOD`	Method to update `\alpha`, the concentration parameter of the Dirichlet-process prior. Required. Choices are: "Gibbs" for a Gibbs sampler developed in Escobar and West (1995), and "MH" for a Metropolis-Hastings sampler.
`RELABEL.THRESHOLD`	A positive scalar. Used to determine whether the optimization in the relabeling algorithm has converged.
`OUTPUT.CLUST.SIZE`	If TRUE, output cluster sizes in MCMC iterations into an external file *_mcmc_size.out.
`PRINT`	If TRUE, print intermediate values during an MCMC run onto the screen. Used for debugging with small data sets.

Details

DIRECT is a mixture model-based clustering method. It consists of two major steps:

MCMC sampling. DIRECT generates MCMC samples of assignments to mixture components (number of components implicitly generated; written into external file *_mcmc_cs.out) and component-specific parameters (written into external file *_mcmc_pars.out), which include mean vectors and standard deviations of three types of variability.
Posterior inference, which further consists of two steps:
1. Resampling: DIRECT estimates posterior allocation probability matrix (written into external file *_mcmc_probs.out).
2. Relabeling: DIRECT deals with label-switching by estimating optimal labels of mixture components (written into external file *_mcmc_perms.out), implementing Algorithm 2 in Stephens (2000).

The arguments required to set up a DIRECT run can be divided into five categories:

Data-related, such as data, times and so on.
Initial values of parameters, including c.curr, sdWICluster.curr, sdTSampling.curr, sdResidual.curr and alpha.curr.
Values used to specify prior distributions, such as uWICluster, meanMT1, rateBetaProc, alpha.prior.shape and so on.
Standard deviation used in the proposal distributions for parameters of interest. A normal distribution whose mean is the current value and whose standard deviation is user-specified is used as the proposal. Reflection is used if the proposal is outside the range (e.g., (0,1)) for the parameter.
Miscellaneous arguments for MCMC configuration, for model choices and for output choices.

The user may set up multiple runs with different initial values or values in the prior distributions, and compare the clustering results to check whether the MCMC run has mixed well and whether the inference is sensitive to initial values or priors. If the data are informative enough, initial values and priors should lead to consistent clustering results.

Value

At least four files are generated during a DIRECT run and placed under the current working directory:

*_mcmc_cs.out: Generated from MCMC sampling. Each row contains an MCMC sample of assignments of items to mixture components, or cluster memberships if a component is defined as a cluster, as well as \alpha, the concentration parameter in the Dirichlet-process prior.
*_mcmc_pars.out: Generated from MCMC sampling. Each row contains an MCMC sample of parameters specific to a mixture component. Multiple rows may come from the same MCMC iteration.
*_mcmc_probs.out: Generated from resampling in posterior inference. File contains a matrix of HN \times K, which is H posterior allocation probability matrices stacked up, each matrix of N \times K, where H is the number of recorded MCMC samples, N the number of items and K the inferred number of mixture components.
*_mcmc_perms.out: Generated from relabeling in posterior inference. Each row contains an inferred permutation (relabel) of labels of mixture components.

If argument OUTPUT.CLUST.SIZE=TRUE, the fifth file *_mcmc_size.out is also generated, which contains the cluster sizes of each recorded MCMC sample.

Note

DIRECT calls the following functions adapted or directly taken from other R packages: dMVNorm, rMVNorm and rDirichlet. See documentation of each function for more information.

Author(s)

Audrey Q. Fu

References

Escobar, M. D. and West, M. (1995) Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90: 577-588.

Fu, A. Q., Russell, S., Bray, S. and Tavare, S. (2013) Bayesian clustering of replicated time-course gene expression data with weak signals. The Annals of Applied Statistics. 7(3) 1334-1361.

Stephens, M. (2000) Dealing with label switching in mixture models. Journal of the Royal Statistical Society, Series B, 62: 795-809.

Neal, R. M. (2000) Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9: 249-265.

Examples

## Not run: 
# Load replicated time-course gene expression data
# Use only first 50 genes for test run
data (tc.data)
data = tc.data[1:50,]
times = c(0,5,10,15,20,25,30,35,40,50,60,70,80,90,100,110,120,150)
nGene = nrow (data)
nTime=length (times)
SKIP = 2

# Initial values and MCMC specs
c.curr = rep (1, nGene)    # start with a single cluster
alpha.curr = 0.01

alpha.prior.shape = 1/nGene
alpha.prior.rate = 1

SAMPLE.C.METHOD="FRBT"    # method for sampling cluster memberships
PRIOR.MODEL = "OU"       # prior model for generating mean vector
ALPHA.METHOD = "MH"      # method for sampling concentration parameter
RELABEL.THRESHOLD=0.01   # stopping criterion used in relabeling algorithm

nIter=10
burn.in=0
step.size=1
nResample=2
seed.value = 12

data.name="tmp"          # prefix of output files

# Run DIRECT
# This is a short run that takes less than a minute
# All output files will be under current working directory
DIRECT (data=data, data.name=data.name, SKIP=SKIP, nTime=nTime, times=times, 
    c.curr=c.curr, PAR.INIT=TRUE, alpha.curr=alpha.curr, 
    alpha.prior.shape=alpha.prior.shape, 
    alpha.prior.rate=alpha.prior.rate, 
    nIter=nIter, burn.in=burn.in, step.size=step.size, 
    nResample=nResample, seed.value=seed.value, 
    RNORM.METHOD="svd", SAMPLE.C=SAMPLE.C.METHOD, 
    PRIOR.MODEL=PRIOR.MODEL, ALPHA.METHOD=ALPHA.METHOD, 
    RELABEL.THRESHOLD=RELABEL.THRESHOLD)

# Process MCMC samples from DIRECT
data.name="tmp"          # prefix of output files
tmp.summary = summaryDIRECT (data.name)

# Plot clustering results
#
# If the plots do not display well
# use pdf() to generate the plots in an external pdf
#
# Clustered mean profiles
plotClustersMean (data, tmp.summary, SKIP=2, times=times)
#
# To use pdf(), run the following lines in R
# > pdf ("plot_means.pdf")
# > plotClustersMean (data, tmp.summary, SKIP=2, times=times)
# > dev.off()
#
par (mfrow=c(1,1))
# Posterior estimates of standard deviations 
# of three types of variability in each cluster
plotClustersSD (tmp.summary, nTime=18)
# PCA plot of the posterior allocation probability matrix
plotClustersPCA (data$GeneName, tmp.summary)

## End(Not run)

[Package DIRECT version 1.1.0 Index]