contextCluster {clusternomics}R Documentation

Clusternomics: Context-dependent clustering

Description

This function fits the context-dependent clustering model to the data using Gibbs sampling. It allows the user to specify a different number of clusters on the global level, as well as on the local level.

Usage

contextCluster(datasets, clusterCounts, dataDistributions = "diagNormal",
  prior = NULL, maxIter = 1000, burnin = NULL, lag = 3,
  verbose = FALSE)

Arguments

datasets

List of data matrices where each matrix represents a context-specific dataset. Each data matrix has the size N times M, where N is the number of data points and M is the dimensionality of the data. The full list of matrices has length C. The number of data points N must be the same for all data matrices.

clusterCounts

Number of cluster on the global level and in each context. List with the following structure: clusterCounts = list(global=global, context=context) where global is the number of global clusters, and context is the list of numbers of clusters in the individual contexts (datasets) of length C where context[c] is the number of clusters in dataset c.

dataDistributions

Distribution of data in each dataset. Can be either a list of length C where dataDistributions[c] is the distribution of dataset c, or a single string when all datasets have the same distribution. Currently implemented distribution is the 'diagNormal' option for multivariate Normal distribution with diagonal covariance matrix.

prior

Prior distribution. If NULL then the prior is estimated using the datasets. The 'diagNormal' distribution uses the Normal-Gamma distribution as a prior for each dimension.

maxIter

Number of iterations of the Gibbs sampling algorithm.

burnin

Number of burn-in iterations that will be discarded. If not specified, the algorithm discards the first half of the maxIter samples.

lag

Used for thinning the samples.

verbose

Print progress, by default FALSE.

Value

Returns list containing the sequence of MCMC states and the log likelihoods of the individual states.

samples

List of assignments sampled from the posterior, each state samples[[i]] is a data frame with local and global assignments for each data point.

logliks

Log likelihoods during MCMC iterations.

DIC

Deviance information criterion to help select the number of clusters. Lower values of DIC correspond to better-fitting models.

Examples

# Example with simulated data (see vignette for details)
# Number of elements in each cluster
groupCounts <- c(50, 10, 40, 60)
# Centers of clusters
means <- c(-1.5,1.5)
testData <- generateTestData_2D(groupCounts, means)
datasets <- testData$data

# Fit the model
# 1. specify number of clusters
clusterCounts <- list(global=10, context=c(3,3))
# 2. Run inference
# Number of iterations is just for demonstration purposes, use
# a larger number of iterations in practice!
results <- contextCluster(datasets, clusterCounts,
     maxIter = 10, burnin = 5, lag = 1,
     dataDistributions = 'diagNormal',
     verbose = TRUE)

# Extract results from the samples
# Final state:
state <- results$samples[[length(results$samples)]]
# 1) assignment to global clusters
globalAssgn <- state$Global
# 2) context-specific assignmnets- assignment in specific dataset (context)
contextAssgn <- state[,"Context 1"]
# Assess the fit of the model with DIC
results$DIC


[Package clusternomics version 0.1.1 Index]