Clustering {sharp}R Documentation

Consensus clustering

Description

Performs consensus (weighted) clustering. The underlying algorithm (e.g. hierarchical clustering) is run with different number of clusters nc. In consensus weighed clustering, weighted distances are calculated using the cosa2 algorithm with different penalty parameters Lambda. The hyper-parameters are calibrated by maximisation of the consensus score.

Usage

Clustering(
  xdata,
  nc = NULL,
  eps = NULL,
  Lambda = NULL,
  K = 100,
  tau = 0.5,
  seed = 1,
  n_cat = 3,
  implementation = HierarchicalClustering,
  scale = TRUE,
  linkage = "complete",
  row = TRUE,
  optimisation = c("grid_search", "nloptr"),
  n_cores = 1,
  output_data = FALSE,
  verbose = TRUE,
  beep = NULL,
  ...
)

Arguments

xdata

data matrix with observations as rows and variables as columns.

nc

matrix of parameters controlling the number of clusters in the underlying algorithm specified in implementation. If nc is not provided, it is set to seq(1, tau*nrow(xdata)).

eps

radius in density-based clustering, see dbscan. Only used if implementation=DBSCANClustering.

Lambda

vector of penalty parameters for weighted distance calculation. Only used for distance-based clustering, including for example implementation=HierarchicalClustering, implementation=PAMClustering, or implementation=DBSCANClustering.

K

number of resampling iterations.

tau

subsample size.

seed

value of the seed to initialise the random number generator and ensure reproducibility of the results (see set.seed).

n_cat

computation options for the stability score. Default is NULL to use the score based on a z test. Other possible values are 2 or 3 to use the score based on the negative log-likelihood.

implementation

function to use for clustering. Possible functions include HierarchicalClustering (hierarchical clustering), PAMClustering (Partitioning Around Medoids), KMeansClustering (k-means) and GMMClustering (Gaussian Mixture Models). Alternatively, a user-defined function taking xdata and Lambda as arguments and returning a binary and symmetric matrix for which diagonal elements are equal to zero can be used.

scale

logical indicating if the data should be scaled to ensure that all variables contribute equally to the clustering of the observations.

linkage

character string indicating the type of linkage used in hierarchical clustering to define the stable clusters. Possible values include "complete", "single" and "average" (see argument "method" in hclust for a full list). Only used if implementation=HierarchicalClustering.

row

logical indicating if rows (if row=TRUE) or columns (if row=FALSE) contain the items to cluster.

optimisation

character string indicating the type of optimisation method to calibrate the regularisation parameter (only used if Lambda is not NULL). With optimisation="grid_search" (the default), all values in Lambda are visited. Alternatively, optimisation algorithms implemented in nloptr can be used with optimisation="nloptr". By default, we use "algorithm"="NLOPT_GN_DIRECT_L", "xtol_abs"=0.1, "ftol_abs"=0.1 and "maxeval" defined as length(Lambda). These values can be changed by providing the argument opts (see nloptr).

n_cores

number of cores to use for parallel computing (see argument workers in multisession). Using n_cores>1 is only supported with optimisation="grid_search".

output_data

logical indicating if the input datasets xdata and ydata should be included in the output.

verbose

logical indicating if a loading bar and messages should be printed.

beep

sound indicating the end of the run. Possible values are: NULL (no sound) or an integer between 1 and 11 (see argument sound in beep).

...

additional parameters passed to the functions provided in implementation or resampling.

Details

In consensus clustering, a clustering algorithm is applied on K subsamples of the observations with different numbers of clusters provided in nc. If row=TRUE (the default), the observations (rows) are the items to cluster. If row=FALSE, the variables (columns) are the items to cluster. For a given number of clusters, the consensus matrix coprop stores the proportion of iterations where two items were in the same estimated cluster, out of all iterations where both items were drawn in the subsample.

Stable cluster membership is obtained by applying a distance-based clustering method using (1-coprop) as distance (see Clusters).

These parameters can be calibrated by maximisation of a stability score (see ConsensusScore) calculated under the null hypothesis of equiprobability of co-membership.

It is strongly recommended to examine the calibration plot (see CalibrationPlot) to check that there is a clear maximum. The absence of a clear maximum suggests that the clustering is not stable, consensus clustering outputs should not be trusted in that case.

To ensure reproducibility of the results, the starting number of the random number generator is set to seed.

For parallelisation, stability selection with different sets of parameters can be run on n_cores cores. Using n_cores > 1 creates a multisession.

Value

An object of class clustering. A list with:

Sc

a matrix of the best stability scores for different (sets of) parameters controlling the number of clusters and penalisation of attribute weights.

nc

a matrix of numbers of clusters.

Lambda

a matrix of regularisation parameters for attribute weights.

Q

a matrix of the average number of selected attributes by the underlying algorithm with different regularisation parameters.

coprop

an array of consensus matrices. Rows and columns correspond to items. Indices along the third dimension correspond to different parameters controlling the number of clusters and penalisation of attribute weights.

selprop

an array of selection proportions. Columns correspond to attributes. Rows correspond to different parameters controlling the number of clusters and penalisation of attribute weights.

method

a list with type="clustering" and values used for arguments implementation, linkage, and resampling.

params

a list with values used for arguments K, tau, pk, n (number of observations in xdata), and seed.

The rows of Sc, nc, Lambda, Q, selprop and indices along the third dimension of coprop are ordered in the same way and correspond to parameter values stored in nc and Lambda.

References

Bodinier B, Vuckovic D, Rodrigues S, Filippi S, Chiquet J, Chadeau-Hyam M (2023). “Automated calibration of consensus weighted distance-based clustering approaches using sharp.” Bioinformatics, btad635. ISSN 1367-4811, doi:10.1093/bioinformatics/btad635, https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btad635/52191190/btad635.pdf.

Kampert MM, Meulman JJ, Friedman JH (2017). “rCOSA: A Software Package for Clustering Objects on Subsets of Attributes.” Journal of Classification, 34(3), 514–547. doi:10.1007/s00357-017-9240-z.

Friedman JH, Meulman JJ (2004). “Clustering objects on subsets of attributes (with discussion).” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(4), 815-849. doi:10.1111/j.1467-9868.2004.02059.x, https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-9868.2004.02059.x, https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2004.02059.x.

Monti S, Tamayo P, Mesirov J, Golub T (2003). “Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data.” Machine Learning, 52(1), 91–118. doi:10.1023/A:1023949509487.

See Also

Resample, ConsensusScore, HierarchicalClustering, PAMClustering, KMeansClustering, GMMClustering

Other stability functions: BiSelection(), GraphicalModel(), StructuralModel(), VariableSelection()

Examples


# Consensus clustering
set.seed(1)
simul <- SimulateClustering(
  n = c(30, 30, 30), nu_xc = 1, ev_xc = 0.5
)
stab <- Clustering(xdata = simul$data)
print(stab)
CalibrationPlot(stab)
summary(stab)
Clusters(stab)
plot(stab)

# Consensus weighted clustering
if (requireNamespace("rCOSA", quietly = TRUE)) {
  set.seed(1)
  simul <- SimulateClustering(
    n = c(30, 30, 30), pk = 20,
    theta_xc = c(rep(1, 10), rep(0, 10)),
    ev_xc = 0.9
  )
  stab <- Clustering(
    xdata = simul$data,
    Lambda = LambdaSequence(lmin = 0.1, lmax = 10, cardinal = 10),
    noit = 20, niter = 10
  )
  print(stab)
  CalibrationPlot(stab)
  summary(stab)
  Clusters(stab)
  plot(stab)
  WeightBoxplot(stab)
}


[Package sharp version 1.4.6 Index]