sim_data {iClusterVB}R Documentation

Simulated Dataset

Description

The dataset consists of N = 240 individuals and R = 4 data views with different data types. Two of the data views are continuous, one is count, and one is binary. The true number of clusters was set to K = 4, and the cluster proportions were set at \pi_1 = 0.25, \pi_2 = 0.25, \pi_3 = 0.25, \pi_4 = 0.25, such that we have balanced cluster proportions. Each of the data views had p_r = 500 features, r = 1, \dots, 4, but only 50, or 10%, were relevant features that contributed to the clustering, and the rest were noise features that did not contribute to the clustering. In total, there were p = \sum_{r=1}^4 = 2000 features.

For data view 1 (continuous), relevant features were generated from the following normal distributions: \text{N}(10, 1) for Cluster 1, \text{N}(5, 1) for Cluster 2, \text{N}(-5, 1) for Cluster 3, and \text{N}(-10, 1) for Cluster 4, while noise features were generated from \text{N}(0, 1). For data view 2 (continuous), relevant features were generated from the following normal distributions: \text{N}(-10, 1) for Cluster 1, \text{N}(-5, 1) for Cluster 2, \text{N}(5, 1) for Cluster 3, and \text{N}(10, 1) for Cluster 4, while noise features were generated from \text{N}(0, 1). For data view 3 (binary), relevant features were generated from the following Bernoulli distributions: \text{Bernoulli}(0.05) for Cluster 1, \text{Bernoulli}(0.2) for Cluster 2, \text{Bernoulli}(0.4) for Cluster 3, and \text{Bernoulli}(0.6) for Cluster 4, while noise features were generated from \text{Bernoulli}(0.1). For data view 4 (count), relevant features were generated from the following Poisson distributions: \text{Poisson}(50) for Cluster 1, \text{Poisson}(35) for Cluster 2, \text{Poisson}(20) for Cluster 3, and \text{Poisson}(10) for Cluster 4, while noise features were generated from \text{Poisson}(2).

Usage

data(sim_data)

Format

A list containing four datasets, and other elements of interest.


[Package iClusterVB version 0.1.1 Index]