sim_data {iClusterVB} | R Documentation |
Simulated Dataset
Description
The dataset consists of N = 240
individuals and R =
4
data views with different data types. Two of the data views are
continuous, one is count, and one is binary. The true number of
clusters was set to K = 4
, and the cluster proportions were set at \pi_1
= 0.25, \pi_2 = 0.25, \pi_3 = 0.25, \pi_4 = 0.25
, such that we have
balanced cluster proportions. Each of the data views had p_r = 500
features, r = 1, \dots, 4
, but only 50, or 10%, were relevant
features that contributed to the clustering, and the rest were noise
features that did not contribute to the clustering. In total, there were
p = \sum_{r=1}^4 = 2000
features.
For data view 1 (continuous), relevant features were generated from the
following normal distributions: \text{N}(10, 1)
for Cluster 1,
\text{N}(5, 1)
for Cluster 2, \text{N}(-5, 1)
for Cluster 3,
and \text{N}(-10, 1)
for Cluster 4, while noise features were
generated from \text{N}(0, 1)
. For data view 2 (continuous), relevant
features were generated from the following normal distributions:
\text{N}(-10, 1)
for Cluster 1, \text{N}(-5, 1)
for Cluster
2, \text{N}(5, 1)
for Cluster 3, and \text{N}(10, 1)
for
Cluster 4, while noise features were generated from \text{N}(0, 1)
.
For data view 3 (binary), relevant features were generated from the
following Bernoulli distributions: \text{Bernoulli}(0.05)
for Cluster
1, \text{Bernoulli}(0.2)
for Cluster 2,
\text{Bernoulli}(0.4)
for Cluster 3, and \text{Bernoulli}(0.6)
for Cluster 4, while noise features were generated from
\text{Bernoulli}(0.1)
. For data view 4 (count), relevant features
were generated from the following Poisson distributions:
\text{Poisson}(50)
for Cluster 1, \text{Poisson}(35)
for
Cluster 2, \text{Poisson}(20)
for Cluster 3, and
\text{Poisson}(10)
for Cluster 4, while noise features were generated
from \text{Poisson}(2)
.
Usage
data(sim_data)
Format
A list containing four datasets, and other elements of interest.