R: Clusternomics: Context-dependent clustering

contextCluster {clusternomics}

R Documentation

Clusternomics: Context-dependent clustering

Description

This function fits the context-dependent clustering model to the data using Gibbs sampling. It allows the user to specify a different number of clusters on the global level, as well as on the local level.

Usage

contextCluster(datasets, clusterCounts, dataDistributions = "diagNormal",
  prior = NULL, maxIter = 1000, burnin = NULL, lag = 3,
  verbose = FALSE)

Arguments

`datasets`	List of data matrices where each matrix represents a context-specific dataset. Each data matrix has the size N times M, where N is the number of data points and M is the dimensionality of the data. The full list of matrices has length C. The number of data points N must be the same for all data matrices.
`clusterCounts`	Number of cluster on the global level and in each context. List with the following structure: `clusterCounts = list(global=global, context=context)` where `global` is the number of global clusters, and `context` is the list of numbers of clusters in the individual contexts (datasets) of length C where `context[c]` is the number of clusters in dataset c.
`dataDistributions`	Distribution of data in each dataset. Can be either a list of length C where `dataDistributions[c]` is the distribution of dataset c, or a single string when all datasets have the same distribution. Currently implemented distribution is the `'diagNormal'` option for multivariate Normal distribution with diagonal covariance matrix.
`prior`	Prior distribution. If `NULL` then the prior is estimated using the datasets. The `'diagNormal'` distribution uses the Normal-Gamma distribution as a prior for each dimension.
`maxIter`	Number of iterations of the Gibbs sampling algorithm.
`burnin`	Number of burn-in iterations that will be discarded. If not specified, the algorithm discards the first half of the `maxIter` samples.
`lag`	Used for thinning the samples.
`verbose`	Print progress, by default `FALSE`.

Value

Returns list containing the sequence of MCMC states and the log likelihoods of the individual states.

`samples`	List of assignments sampled from the posterior, each state `samples[[i]]` is a data frame with local and global assignments for each data point.
`logliks`	Log likelihoods during MCMC iterations.
`DIC`	Deviance information criterion to help select the number of clusters. Lower values of DIC correspond to better-fitting models.

Examples

# Example with simulated data (see vignette for details)
# Number of elements in each cluster
groupCounts <- c(50, 10, 40, 60)
# Centers of clusters
means <- c(-1.5,1.5)
testData <- generateTestData_2D(groupCounts, means)
datasets <- testData$data

# Fit the model
# 1. specify number of clusters
clusterCounts <- list(global=10, context=c(3,3))
# 2. Run inference
# Number of iterations is just for demonstration purposes, use
# a larger number of iterations in practice!
results <- contextCluster(datasets, clusterCounts,
     maxIter = 10, burnin = 5, lag = 1,
     dataDistributions = 'diagNormal',
     verbose = TRUE)

# Extract results from the samples
# Final state:
state <- results$samples[[length(results$samples)]]
# 1) assignment to global clusters
globalAssgn <- state$Global
# 2) context-specific assignmnets- assignment in specific dataset (context)
contextAssgn <- state[,"Context 1"]
# Assess the fit of the model with DIC
results$DIC

[Package clusternomics version 0.1.1 Index]