R: Simulation of data with underlying clusters

SimulateClustering {fake}

R Documentation

Simulation of data with underlying clusters

Description

Simulates mixture multivariate Normal data with clusters of items (rows) sharing similar profiles along (a subset of) attributes (columns).

Usage

SimulateClustering(
  n = c(10, 10),
  pk = 10,
  sigma = NULL,
  theta_xc = NULL,
  nu_xc = 1,
  ev_xc = 0.5,
  output_matrices = FALSE
)

Arguments

`n`	vector of the number of items per cluster in the simulated data. The total number of items is `sum(n)`.
`pk`	vector of the number of attributes in the simulated data.
`sigma`	optional within-cluster correlation matrix.
`theta_xc`	optional binary matrix encoding which attributes (columns) contribute to the clustering structure between which clusters (rows). If `theta_xc=NULL`, variables contributing to the clustering are sampled with probability `nu_xc`.
`nu_xc`	expected proportion of variables contributing to the clustering over the total number of variables. Only used if `theta_xc` is not provided.
`ev_xc`	vector of expected proportion of variance in each of the contributing attributes that can be explained by the clustering.
`output_matrices`	logical indicating if the cluster and attribute specific means and cluster specific covariance matrix should be included in the output.

Details

The data is simulated from a Gaussian mixture where for all i \in {1, \dots, n}:

Z_i i.i.d. ~ M(1, \kappa)

X_i | Z_i indep. ~ N_p(\mu_{Z_i}, \Sigma)

where M(1, \kappa) is the multinomial distribution with parameters 1 and \kappa, the vector of length G (the number of clusters) with probabilities of belonging to each of the clusters, and N_p(\mu_{Z_i}, \Sigma) is the multivariate Normal distribution with a mean vector \mu_{Z_i} that depends on the cluster membership encoded in Z_i and the same covariance matrix \Sigma within all G clusters.

The mean vectors \mu_{g}, g \in {1, \dots G} are simulated so that the desired proportion of variance in each of attributes explained by the clustering (argument ev_xc) is reached.

The covariance matrix \Sigma is obtained by re-scaling a correlation matrix (argument sigma) to ensure that the desired proportions of explained variances by the clustering (argument ev_xc) are reached.

Value

A list with:

`data`	simulated data with `sum(n)` observation and `sum(pk)` variables
`theta`	simulated (true) cluster membership.
`theta_xc`	binary vector encoding variables contributing to the clustering structure.
`ev`	vector of marginal expected proportions of explained variance for each variable.
`mu_mixture`	simulated (true) cluster-specific means. Only returned if `output_matrices=TRUE`.
`sigma`	simulated (true) covariance matrix. Only returned if `output_matrices=TRUE`.

Examples

oldpar <- par(no.readonly = TRUE)
par(mar = rep(7, 4))

## Example with 3 clusters

# Data simulation
set.seed(1)
simul <- SimulateClustering(
  n = c(10, 30, 15),
  nu_xc = 1,
  ev_xc = 0.5
)
print(simul)
plot(simul)

# Checking the proportion of explained variance
x <- simul$data[, 1]
z <- as.factor(simul$theta)
summary(lm(x ~ z)) # R-squared


## Example with 2 variables contributing to clustering

# Data simulation
set.seed(1)
simul <- SimulateClustering(
  n = c(20, 10, 15), pk = 10,
  theta_xc = c(1, 1, rep(0, 8)),
  ev_xc = 0.8
)
print(simul)
plot(simul)

# Visualisation of the data
Heatmap(
  mat = simul$data,
  col = c("navy", "white", "red")
)
simul$ev # marginal proportions of explained variance

# Visualisation along contributing variables
plot(simul$data[, 1:2], col = simul$theta, pch = 19)


## Example with different levels of separation

# Data simulation
set.seed(1)
simul <- SimulateClustering(
  n = c(20, 10, 15), pk = 10,
  theta_xc = c(1, 1, rep(0, 8)),
  ev_xc = c(0.99, 0.5, rep(0, 8))
)

# Visualisation along contributing variables
plot(simul$data[, 1:2], col = simul$theta, pch = 19)


## Example with correlated contributors

# Data simulation
pk <- 10
adjacency <- matrix(0, pk, pk)
adjacency[1, 2] <- adjacency[2, 1] <- 1
set.seed(1)
sigma <- SimulateCorrelation(
  pk = pk,
  theta = adjacency,
  pd_strategy = "min_eigenvalue",
  v_within = 0.6, v_sign = -1
)$sigma
simul <- SimulateClustering(
  n = c(200, 100, 150), pk = pk, sigma = sigma,
  theta_xc = c(1, 1, rep(0, 8)),
  ev_xc = c(0.9, 0.8, rep(0, 8))
)

# Visualisation along contributing variables
plot(simul$data[, 1:2], col = simul$theta, pch = 19)

# Checking marginal proportions of explained variance
mymodel <- lm(simul$data[, 1] ~ as.factor(simul$theta))
summary(mymodel)$r.squared
mymodel <- lm(simul$data[, 2] ~ as.factor(simul$theta))
summary(mymodel)$r.squared

par(oldpar)

[Package fake version 1.4.0 Index]