Clustering {sharp} | R Documentation |
Consensus clustering
Description
Performs consensus (weighted) clustering. The underlying algorithm (e.g.
hierarchical clustering) is run with different number of clusters nc
.
In consensus weighed clustering, weighted distances are calculated using the
cosa2
algorithm with different penalty parameters
Lambda
. The hyper-parameters are calibrated by maximisation of the
consensus score.
Usage
Clustering(
xdata,
nc = NULL,
eps = NULL,
Lambda = NULL,
K = 100,
tau = 0.5,
seed = 1,
n_cat = 3,
implementation = HierarchicalClustering,
scale = TRUE,
linkage = "complete",
row = TRUE,
optimisation = c("grid_search", "nloptr"),
n_cores = 1,
output_data = FALSE,
verbose = TRUE,
beep = NULL,
...
)
Arguments
xdata |
data matrix with observations as rows and variables as columns. |
nc |
matrix of parameters controlling the number of clusters in the
underlying algorithm specified in |
eps |
radius in density-based clustering, see
|
Lambda |
vector of penalty parameters for weighted distance calculation.
Only used for distance-based clustering, including for example
|
K |
number of resampling iterations. |
tau |
subsample size. |
seed |
value of the seed to initialise the random number generator and
ensure reproducibility of the results (see |
n_cat |
computation options for the stability score. Default is
|
implementation |
function to use for clustering. Possible functions
include |
scale |
logical indicating if the data should be scaled to ensure that all variables contribute equally to the clustering of the observations. |
linkage |
character string indicating the type of linkage used in
hierarchical clustering to define the stable clusters. Possible values
include |
row |
logical indicating if rows (if |
optimisation |
character string indicating the type of optimisation
method to calibrate the regularisation parameter (only used if
|
n_cores |
number of cores to use for parallel computing (see argument
|
output_data |
logical indicating if the input datasets |
verbose |
logical indicating if a loading bar and messages should be printed. |
beep |
sound indicating the end of the run. Possible values are:
|
... |
additional parameters passed to the functions provided in
|
Details
In consensus clustering, a clustering algorithm is applied on
K
subsamples of the observations with different numbers of clusters
provided in nc
. If row=TRUE
(the default), the observations
(rows) are the items to cluster. If row=FALSE
, the variables
(columns) are the items to cluster. For a given number of clusters, the
consensus matrix coprop
stores the proportion of iterations where
two items were in the same estimated cluster, out of all iterations where
both items were drawn in the subsample.
Stable cluster membership is obtained by applying a distance-based
clustering method using (1-coprop)
as distance (see
Clusters).
These parameters can be calibrated by maximisation of a stability score
(see ConsensusScore
) calculated under the null hypothesis of
equiprobability of co-membership.
It is strongly recommended to examine the calibration plot (see
CalibrationPlot
) to check that there is a clear maximum. The
absence of a clear maximum suggests that the clustering is not stable,
consensus clustering outputs should not be trusted in that case.
To ensure reproducibility of the results, the starting number of the random
number generator is set to seed
.
For parallelisation, stability selection with different sets of parameters
can be run on n_cores
cores. Using n_cores > 1
creates a
multisession
.
Value
An object of class clustering
. A list with:
Sc |
a matrix of the best stability scores for different (sets of) parameters controlling the number of clusters and penalisation of attribute weights. |
nc |
a matrix of numbers of clusters. |
Lambda |
a matrix of regularisation parameters for attribute weights. |
Q |
a matrix of the average number of selected attributes by the underlying algorithm with different regularisation parameters. |
coprop |
an array of consensus matrices. Rows and columns correspond to items. Indices along the third dimension correspond to different parameters controlling the number of clusters and penalisation of attribute weights. |
selprop |
an array of selection proportions. Columns correspond to attributes. Rows correspond to different parameters controlling the number of clusters and penalisation of attribute weights. |
method |
a list with |
params |
a list with values used for arguments
|
The rows of Sc
, nc
,
Lambda
, Q
, selprop
and indices along the third
dimension of coprop
are ordered in the same way and correspond to
parameter values stored in nc
and Lambda
.
References
Bodinier B, Vuckovic D, Rodrigues S, Filippi S, Chiquet J, Chadeau-Hyam M (2023). “Automated calibration of consensus weighted distance-based clustering approaches using sharp.” Bioinformatics, btad635. ISSN 1367-4811, doi:10.1093/bioinformatics/btad635, https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btad635/52191190/btad635.pdf.
Kampert MM, Meulman JJ, Friedman JH (2017). “rCOSA: A Software Package for Clustering Objects on Subsets of Attributes.” Journal of Classification, 34(3), 514–547. doi:10.1007/s00357-017-9240-z.
Friedman JH, Meulman JJ (2004). “Clustering objects on subsets of attributes (with discussion).” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(4), 815-849. doi:10.1111/j.1467-9868.2004.02059.x, https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-9868.2004.02059.x, https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9868.2004.02059.x.
Monti S, Tamayo P, Mesirov J, Golub T (2003). “Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data.” Machine Learning, 52(1), 91–118. doi:10.1023/A:1023949509487.
See Also
Resample
, ConsensusScore
,
HierarchicalClustering
, PAMClustering
,
KMeansClustering
, GMMClustering
Other stability functions:
BiSelection()
,
GraphicalModel()
,
StructuralModel()
,
VariableSelection()
Examples
# Consensus clustering
set.seed(1)
simul <- SimulateClustering(
n = c(30, 30, 30), nu_xc = 1, ev_xc = 0.5
)
stab <- Clustering(xdata = simul$data)
print(stab)
CalibrationPlot(stab)
summary(stab)
Clusters(stab)
plot(stab)
# Consensus weighted clustering
if (requireNamespace("rCOSA", quietly = TRUE)) {
set.seed(1)
simul <- SimulateClustering(
n = c(30, 30, 30), pk = 20,
theta_xc = c(rep(1, 10), rep(0, 10)),
ev_xc = 0.9
)
stab <- Clustering(
xdata = simul$data,
Lambda = LambdaSequence(lmin = 0.1, lmax = 10, cardinal = 10),
noit = 20, niter = 10
)
print(stab)
CalibrationPlot(stab)
summary(stab)
Clusters(stab)
plot(stab)
WeightBoxplot(stab)
}