iClusterVB {iClusterVB} | R Documentation |
Fast Integrative Clustering for High-Dimensional Multi-View Data Using Variational Bayesian Inference
Description
iClusterVB
offers a novel, fast, and integrative approach to
clustering high-dimensional, mixed-type, and multi-view data. By employing
variational Bayesian inference, iClusterVB facilitates effective feature
selection and identification of disease subtypes, enhancing clinical
decision-making.
Usage
iClusterVB(
mydata,
dist,
K = 10,
initial_method = "VarSelLCM",
VS_method = 0,
initial_cluster = NULL,
initial_vs_prob = NULL,
initial_fit = NULL,
initial_omega = NULL,
input_hyper_parameters = NULL,
max_iter = 200,
early_stop = 1,
per = 10,
convergence_threshold = 1e-04
)
Arguments
mydata |
A list of length R, where R is the number of datasets, containing the input data.
|
dist |
A vector of length R specifying the type of data or distribution. Options include: 'gaussian' (for continuous data), 'multinomial' (for binary or categorical data), and 'poisson' (for count data). |
K |
The maximum number of clusters, with a default value of 10. The algorithm will converge to a model with dominant clusters, removing redundant clusters and automating the determination of the number of clusters. |
initial_method |
The initialization method for cluster allocation. Options include: "VarSelLCM" (default), "random", "kproto" (k-prototypes), "kmeans" (continuous data only), "mclust" (continuous data only), or "lca" (poLCA, categorical data only). |
VS_method |
The variable/feature selection method. Options are 0 for clustering without variable/feature selection (default) and 1 for clustering with variable/feature selection. |
initial_cluster |
The initial cluster membership. The default is NULL, which uses initial_method for initial cluster allocation. If not NULL, it will override the initial values setting for this parameter. |
initial_vs_prob |
The initial variable/feature selection probability, a scalar. The default is NULL, which assigns a value of 0.5. |
initial_fit |
Initial values based on a previously fitted iClusterVB model (an iClusterVB object). The default is NULL. |
initial_omega |
Customized initial values for feature inclusion probabilities. The default is NULL. If not NULL, it will override the initial values setting for this parameter. If VS_method = 1, initial_omega is a list of length R, with each element being an array with dimensions {dim=c(N, p[[r]])}. Here, N is the sample size and p[[r]] is the number of features for dataset r, where r = 1, ..., R. |
input_hyper_parameters |
A list of the initial hyper-parameters of the prior distributions for the model. The default is NULL, which assigns alpha_00 = 0.001, mu_00 = 0, s2_00 = 100, a_00 = 1, b_00 = 1,kappa_00 = 1, u_00 = 1, v_00 = 1. |
max_iter |
The maximum number of iterations for the VB algorithm. The default is 200. |
early_stop |
Whether to stop the algorithm upon convergence or to
continue until |
per |
Print information every "per" iterations. The default is 10. |
convergence_threshold |
The convergence threshold for the change in ELBO. The default is 0.0001. |
Value
The iClusterVB
function creates an object (list) of class
iClusterVB
. Relevant outputs include:
elbo: |
The evidence lower bound for each iteration. |
cluster: |
The cluster assigned to each individual. |
initial_values: |
A list of the initial values. |
hyper_parameters: |
A list of the hyper-parameters. |
model_parameters: |
A list of the model parameters after the algorithm is run. |
Of particular interest is
rho
, a list of the posterior inclusion probabilities for the features in each of the data views. This is the probability of including a certain predictor in the model, given the observations. This is only available ifVS_method = 1
.
Note
If any of the data views are "gaussian", please include them
first, both in the input data mydata
and correspondingly in
the
distribution vector dist
. For example, dist <-
c("gaussian","gaussian", "poisson", "multinomial")
, and not
dist <- c("poisson", "gaussian","gaussian", "multinomial")
or
dist <- c("gaussian", "poisson", "gaussian", "multinomial")
Examples
# sim_data comes with the iClusterVB package.
dat1 <- list(
gauss_1 = sim_data$continuous1_data[c(1:20, 61:80, 121:140, 181:200), 1:75],
gauss_2 = sim_data$continuous2_data[c(1:20, 61:80, 121:140, 181:200), 1:75],
poisson_1 = sim_data$count_data[c(1:20, 61:80, 121:140, 181:200), 1:75],
multinomial_1 = sim_data$binary_data[c(1:20, 61:80, 121:140, 181:200), 1:75]
)
# We re-code `0`s to `2`s
dat1$multinomial_1[dat1$multinomial_1 == 0] <- 2
dist <- c(
"gaussian", "gaussian",
"poisson", "multinomial"
)
# Note: `max_iter` is a time-intensive step.
# For the purpose of testing the code, use a small value (e.g. 10).
# For more accurate results, use a larger value (e.g. 200).
fit_iClusterVB <- iClusterVB(
mydata = dat1,
dist = dist,
K = 4,
initial_method = "VarSelLCM",
VS_method = 1,
max_iter = 50
)
# We can obtain a summary using the summary() function
summary(fit_iClusterVB)