sim_data {iClusterVB}R Documentation

Simulated Dataset

Description

The dataset consists of N=240N = 240 individuals and R=4R = 4 data views with different data types. Two of the data views are continuous, one is count, and one is binary. The true number of clusters was set to K=4K = 4, and the cluster proportions were set at π1=0.25,π2=0.25,π3=0.25,π4=0.25\pi_1 = 0.25, \pi_2 = 0.25, \pi_3 = 0.25, \pi_4 = 0.25, such that we have balanced cluster proportions. Each of the data views had pr=500p_r = 500 features, r=1,,4r = 1, \dots, 4, but only 50, or 10%, were relevant features that contributed to the clustering, and the rest were noise features that did not contribute to the clustering. In total, there were p=r=14=2000p = \sum_{r=1}^4 = 2000 features.

For data view 1 (continuous), relevant features were generated from the following normal distributions: N(10,1)\text{N}(10, 1) for Cluster 1, N(5,1)\text{N}(5, 1) for Cluster 2, N(5,1)\text{N}(-5, 1) for Cluster 3, and N(10,1)\text{N}(-10, 1) for Cluster 4, while noise features were generated from N(0,1)\text{N}(0, 1). For data view 2 (continuous), relevant features were generated from the following normal distributions: N(10,1)\text{N}(-10, 1) for Cluster 1, N(5,1)\text{N}(-5, 1) for Cluster 2, N(5,1)\text{N}(5, 1) for Cluster 3, and N(10,1)\text{N}(10, 1) for Cluster 4, while noise features were generated from N(0,1)\text{N}(0, 1). For data view 3 (binary), relevant features were generated from the following Bernoulli distributions: Bernoulli(0.05)\text{Bernoulli}(0.05) for Cluster 1, Bernoulli(0.2)\text{Bernoulli}(0.2) for Cluster 2, Bernoulli(0.4)\text{Bernoulli}(0.4) for Cluster 3, and Bernoulli(0.6)\text{Bernoulli}(0.6) for Cluster 4, while noise features were generated from Bernoulli(0.1)\text{Bernoulli}(0.1). For data view 4 (count), relevant features were generated from the following Poisson distributions: Poisson(50)\text{Poisson}(50) for Cluster 1, Poisson(35)\text{Poisson}(35) for Cluster 2, Poisson(20)\text{Poisson}(20) for Cluster 3, and Poisson(10)\text{Poisson}(10) for Cluster 4, while noise features were generated from Poisson(2)\text{Poisson}(2).

Usage

data(sim_data)

Format

A list containing four datasets, and other elements of interest.


[Package iClusterVB version 0.1.1 Index]