dice {diceR} | R Documentation |
Diverse Clustering Ensemble
Description
Runs consensus clustering across subsamples, algorithms, and number of clusters (k).
Usage
dice(
data,
nk,
p.item = 0.8,
reps = 10,
algorithms = NULL,
k.method = NULL,
nmf.method = c("brunet", "lee"),
hc.method = "average",
distance = "euclidean",
cons.funs = c("kmodes", "majority", "CSPA", "LCE", "LCA"),
sim.mat = c("cts", "srs", "asrs"),
prep.data = c("none", "full", "sampled"),
min.var = 1,
seed = 1,
seed.data = 1,
trim = FALSE,
reweigh = FALSE,
n = 5,
evaluate = TRUE,
plot = FALSE,
ref.cl = NULL,
progress = TRUE
)
Arguments
data |
data matrix with rows as samples and columns as variables |
nk |
number of clusters (k) requested; can specify a single integer or a range of integers to compute multiple k |
p.item |
proportion of items to be used in subsampling within an algorithm |
reps |
number of subsamples |
algorithms |
vector of clustering algorithms for performing consensus clustering. Must be any number of the following: "nmf", "hc", "diana", "km", "pam", "ap", "sc", "gmm", "block", "som", "cmeans", "hdbscan". A custom clustering algorithm can be used. |
k.method |
determines the method to choose k when no reference class is
given. When |
nmf.method |
specify NMF-based algorithms to run. By default the
"brunet" and "lee" algorithms are called. See |
hc.method |
agglomeration method for hierarchical clustering. The
the "average" method is used by default. See |
distance |
a vector of distance functions. Defaults to "euclidean".
Other options are given in |
cons.funs |
consensus functions to use. Current options are "kmodes" (k-modes), "majority" (majority voting), "CSPA" (Cluster-based Similarity Partitioning Algorithm), "LCE" (linkage clustering ensemble), "LCA" (latent class analysis) |
sim.mat |
similarity matrix; choices are "cts", "srs", "asrs". |
prep.data |
Prepare the data on the "full" dataset, the "sampled" dataset, or "none" (default). |
min.var |
minimum variability measure threshold used to filter the
feature space for only highly variable features. Only features with a
minimum variability measure across all samples greater than |
seed |
random seed for knn imputation reproducibility |
seed.data |
seed to use to ensure each algorithm operates on the same set of subsamples |
trim |
logical; if |
reweigh |
logical; if |
n |
an integer specifying the top |
evaluate |
logical; if |
plot |
logical; if |
ref.cl |
reference class |
progress |
logical; should a progress bar be displayed? |
Details
There are three ways to handle the input data before clustering via argument
prep.data
. The default is to use the raw data as-is ("none"). Or, we can
enact prepare_data()
on the full dataset ("full"), or the bootstrap sampled
datasets ("sampled").
Value
A list with the following elements
E |
raw clustering ensemble object |
Eknn |
clustering ensemble object with knn imputation used on |
Ecomp |
flattened ensemble object with remaining missing entries imputed by majority voting |
clusters |
final clustering assignment from the diverse clustering ensemble method |
indices |
if |
Author(s)
Aline Talhouk, Derek Chiu
Examples
library(dplyr)
data(hgsc)
dat <- hgsc[1:100, 1:50]
ref.cl <- strsplit(rownames(dat), "_") %>%
purrr::map_chr(2) %>%
factor() %>%
as.integer()
dice.obj <- dice(dat, nk = 4, reps = 5, algorithms = "hc", cons.funs =
"kmodes", ref.cl = ref.cl, progress = FALSE)
str(dice.obj, max.level = 2)