klic {klic} | R Documentation |
Kernel learning integrative clustering
Description
This function allows to perform Kernel Learning Integrative Clustering on M data sets relative to the same observations. The similarities between the observations in each data set are summarised into M different kernels, that are then fed into a kernel k-means clustering algorithm. The output is a clustering of the observations that takes into account all the available data types and a set of weights that sum up to one, indicating how much each data set contributed to the kernel k-means clustering.
Usage
klic(
data,
M,
individualK = NULL,
individualMaxK = 6,
individualClAlgorithm = "kkmeans",
globalK = NULL,
globalMaxK = 6,
B = 1000,
C = 100,
scale = FALSE,
savePNG = FALSE,
fileName = "klic",
verbose = TRUE,
annotations = NULL,
ccClMethods = "kmeans",
ccDistHCs = "euclidean",
widestGap = FALSE,
dunns = FALSE,
dunn2s = FALSE
)
Arguments
data |
List of M datasets, each of size N X P_m, m = 1, ..., M. |
M |
number of datasets. |
individualK |
Vector containing the number of clusters in each dataset. Default is NULL. If the number of clusters is not provided, then all the possible values between 2 and individualMaxK are considered and the best value is chosen for each dataset by maximising the silhouette. |
individualMaxK |
Maximum number of clusters considered for the individual data. Default is 6. |
individualClAlgorithm |
Clustering algorithm used for clustering of each dataset individually if is required to find the best number of clusters. |
globalK |
Number of global clusters. Default is NULL. If the number of clusters is not provided, then all the possible values between 2 and globalMaxK are considered and the best value is chosen by maximising the silhouette. |
globalMaxK |
Maximum number of clusters considered for the final clustering. Default is 6. |
B |
Number of iterations for consensus clustering. Default is 1000. |
C |
Maximum number of iterations for localised kernel k-means. Default is 100. |
scale |
Boolean. If TRUE, each dataset is scaled such that each column has zero mean and unitary variance. |
savePNG |
Boolean. If TRUE, a plot of the silhouette is saved in the working folder. Default is FALSE. |
fileName |
If |
verbose |
Boolean. Default is TRUE. |
annotations |
Data frame containing annotations for final plot. |
ccClMethods |
The i-th element of this vector goes into the
|
ccDistHCs |
The i-th element of this vector goes into the |
widestGap |
Boolean. If TRUE, compute also widest gap index to choose best number of clusters. Default is FALSE. |
dunns |
Boolean. If TRUE, compute also Dunn's index to choose best number of clusters. Default is FALSE. |
dunn2s |
Boolean. If TRUE, compute also alternative Dunn's index to choose best number of clusters. Default is FALSE. |
Value
The function returns a list contatining:
consensusMatrices |
an array containing one consensus matrix per data set. |
weights |
a vector containing the weights assigned by the kernel k-means algorithm to each consensus matrix. |
weightedKM |
the weighted kernel matrix obtained by taking a weighted
sum of all kernels, where the weights are those specified in the
|
globalClusterLabels |
a vector containing the cluster labels of the observations, according to kernel k-means clustering done on the kernel matrices. |
bestK |
a vector containing the best number of clusters between 2 and
|
globalK |
the
best number of clusters for the final (global) clustering. This is chosen so
as to maximise the silhouette and only returned if the final number of
clusters |
Author(s)
Alessandra Cabassi alessandra.cabassi@mrc-bsu.cam.ac.uk
References
Cabassi, A. and Kirk, P. D. W. (2019). Multiple kernel learning for integrative consensus clustering of genomic datasets. arXiv preprint. arXiv:1904.07701.
Examples
if(requireNamespace("Rmosek", quietly = TRUE) &&
(!is.null(utils::packageDescription("Rmosek")$Configured.MSK_VERSION))){
# Load synthetic data
data1 <- as.matrix(read.csv(system.file('extdata',
'dataset1.csv', package = 'klic'), row.names = 1))
data2 <- as.matrix(read.csv(system.file('extdata',
'dataset2.csv', package = 'klic'), row.names = 1))
data3 <- as.matrix(read.csv(system.file('extdata',
'dataset3.csv', package = 'klic'), row.names = 1))
data <- list(data1, data2, data3)
# Perform clustering with KLIC assuming to know the
# number of clusters in each individual dataset and in
# the final clustering
klicOutput <- klic(data, 3, individualK = c(4, 4, 4),
globalK = 4, B = 30, C = 5)
# Extract cluster labels
klic_labels <- klicOutput$globalClusterLabels
cluster_labels <- as.matrix(read.csv(system.file('extdata',
'cluster_labels.csv', package = 'klic'), row.names = 1))
# Compute ARI
ari <- mclust::adjustedRandIndex(klic_labels, cluster_labels)
}