R: Fast Integrative Clustering for High-Dimensional Multi-View...

iClusterVB {iClusterVB}

R Documentation

Fast Integrative Clustering for High-Dimensional Multi-View Data Using Variational Bayesian Inference

Description

iClusterVB offers a novel, fast, and integrative approach to clustering high-dimensional, mixed-type, and multi-view data. By employing variational Bayesian inference, iClusterVB facilitates effective feature selection and identification of disease subtypes, enhancing clinical decision-making.

Usage

iClusterVB(
  mydata,
  dist,
  K = 10,
  initial_method = "VarSelLCM",
  VS_method = 0,
  initial_cluster = NULL,
  initial_vs_prob = NULL,
  initial_fit = NULL,
  initial_omega = NULL,
  input_hyper_parameters = NULL,
  max_iter = 200,
  early_stop = 1,
  per = 10,
  convergence_threshold = 1e-04
)

Arguments

`mydata`	A list of length R, where R is the number of datasets, containing the input data. Note: For categorical data, `0`'s must be re-coded to another, non-`0` value.
`dist`	A vector of length R specifying the type of data or distribution. Options include: 'gaussian' (for continuous data), 'multinomial' (for binary or categorical data), and 'poisson' (for count data).
`K`	The maximum number of clusters, with a default value of 10. The algorithm will converge to a model with dominant clusters, removing redundant clusters and automating the determination of the number of clusters.
`initial_method`	The initialization method for cluster allocation. Options include: "VarSelLCM" (default), "random", "kproto" (k-prototypes), "kmeans" (continuous data only), "mclust" (continuous data only), or "lca" (poLCA, categorical data only).
`VS_method`	The variable/feature selection method. Options are 0 for clustering without variable/feature selection (default) and 1 for clustering with variable/feature selection.
`initial_cluster`	The initial cluster membership. The default is NULL, which uses initial_method for initial cluster allocation. If not NULL, it will override the initial values setting for this parameter.
`initial_vs_prob`	The initial variable/feature selection probability, a scalar. The default is NULL, which assigns a value of 0.5.
`initial_fit`	Initial values based on a previously fitted iClusterVB model (an iClusterVB object). The default is NULL.
`initial_omega`	Customized initial values for feature inclusion probabilities. The default is NULL. If not NULL, it will override the initial values setting for this parameter. If VS_method = 1, initial_omega is a list of length R, with each element being an array with dimensions {dim=c(N, p[[r]])}. Here, N is the sample size and p[[r]] is the number of features for dataset r, where r = 1, ..., R.
`input_hyper_parameters`	A list of the initial hyper-parameters of the prior distributions for the model. The default is NULL, which assigns alpha_00 = 0.001, mu_00 = 0, s2_00 = 100, a_00 = 1, b_00 = 1,kappa_00 = 1, u_00 = 1, v_00 = 1.
`max_iter`	The maximum number of iterations for the VB algorithm. The default is 200.
`early_stop`	Whether to stop the algorithm upon convergence or to continue until `max_iter` is reached. Options are 1 (default) to stop when the algorithm converges, and 0 to stop only when `max_iter` is reached.
`per`	Print information every "per" iterations. The default is 10.
`convergence_threshold`	The convergence threshold for the change in ELBO. The default is 0.0001.

Value

The iClusterVB function creates an object (list) of class iClusterVB. Relevant outputs include:

`elbo:`	The evidence lower bound for each iteration.
`cluster:`	The cluster assigned to each individual.
`initial_values:`	A list of the initial values.
`hyper_parameters:`	A list of the hyper-parameters.
`model_parameters:`	A list of the model parameters after the algorithm is run.

Of particular interest is rho, a list of the posterior inclusion probabilities for the features in each of the data views. This is the probability of including a certain predictor in the model, given the observations. This is only available if VS_method = 1.

Note

If any of the data views are "gaussian", please include them first, both in the input data mydata and correspondingly in the distribution vector dist. For example, dist <- c("gaussian","gaussian", "poisson", "multinomial"), and not dist <- c("poisson", "gaussian","gaussian", "multinomial") or dist <- c("gaussian", "poisson", "gaussian", "multinomial")

Examples

# sim_data comes with the iClusterVB package.
dat1 <- list(
  gauss_1 = sim_data$continuous1_data[c(1:20, 61:80, 121:140, 181:200), 1:75],
  gauss_2 = sim_data$continuous2_data[c(1:20, 61:80, 121:140, 181:200), 1:75],
  poisson_1 = sim_data$count_data[c(1:20, 61:80, 121:140, 181:200), 1:75],
  multinomial_1 = sim_data$binary_data[c(1:20, 61:80, 121:140, 181:200), 1:75]
)

# We re-code `0`s to `2`s

dat1$multinomial_1[dat1$multinomial_1 == 0] <- 2

dist <- c(
  "gaussian", "gaussian",
  "poisson", "multinomial"
)

# Note: `max_iter` is a time-intensive step.
# For the purpose of testing the code, use a small value (e.g. 10).
# For more accurate results, use a larger value (e.g. 200).

fit_iClusterVB <- iClusterVB(
  mydata = dat1,
  dist = dist,
  K = 4,
  initial_method = "VarSelLCM",
  VS_method = 1,
  max_iter = 50
)

# We can obtain a summary using the summary() function
summary(fit_iClusterVB)

[Package iClusterVB version 0.1.1 Index]