SimulateClustering {fake}R Documentation

Simulation of data with underlying clusters

Description

Simulates mixture multivariate Normal data with clusters of items (rows) sharing similar profiles along (a subset of) attributes (columns).

Usage

SimulateClustering(
  n = c(10, 10),
  pk = 10,
  sigma = NULL,
  theta_xc = NULL,
  nu_xc = 1,
  ev_xc = 0.5,
  output_matrices = FALSE
)

Arguments

n

vector of the number of items per cluster in the simulated data. The total number of items is sum(n).

pk

vector of the number of attributes in the simulated data.

sigma

optional within-cluster correlation matrix.

theta_xc

optional binary matrix encoding which attributes (columns) contribute to the clustering structure between which clusters (rows). If theta_xc=NULL, variables contributing to the clustering are sampled with probability nu_xc.

nu_xc

expected proportion of variables contributing to the clustering over the total number of variables. Only used if theta_xc is not provided.

ev_xc

vector of expected proportion of variance in each of the contributing attributes that can be explained by the clustering.

output_matrices

logical indicating if the cluster and attribute specific means and cluster specific covariance matrix should be included in the output.

Details

The data is simulated from a Gaussian mixture where for all i \in {1, \dots, n}:

Z_i i.i.d. ~ M(1, \kappa)

X_i | Z_i indep. ~ N_p(\mu_{Z_i}, \Sigma)

where M(1, \kappa) is the multinomial distribution with parameters 1 and \kappa, the vector of length G (the number of clusters) with probabilities of belonging to each of the clusters, and N_p(\mu_{Z_i}, \Sigma) is the multivariate Normal distribution with a mean vector \mu_{Z_i} that depends on the cluster membership encoded in Z_i and the same covariance matrix \Sigma within all G clusters.

The mean vectors \mu_{g}, g \in {1, \dots G} are simulated so that the desired proportion of variance in each of attributes explained by the clustering (argument ev_xc) is reached.

The covariance matrix \Sigma is obtained by re-scaling a correlation matrix (argument sigma) to ensure that the desired proportions of explained variances by the clustering (argument ev_xc) are reached.

Value

A list with:

data

simulated data with sum(n) observation and sum(pk) variables

theta

simulated (true) cluster membership.

theta_xc

binary vector encoding variables contributing to the clustering structure.

ev

vector of marginal expected proportions of explained variance for each variable.

mu_mixture

simulated (true) cluster-specific means. Only returned if output_matrices=TRUE.

sigma

simulated (true) covariance matrix. Only returned if output_matrices=TRUE.

See Also

MakePositiveDefinite

Other simulation functions: SimulateAdjacency(), SimulateComponents(), SimulateCorrelation(), SimulateGraphical(), SimulateRegression(), SimulateStructural()

Examples

oldpar <- par(no.readonly = TRUE)
par(mar = rep(7, 4))

## Example with 3 clusters

# Data simulation
set.seed(1)
simul <- SimulateClustering(
  n = c(10, 30, 15),
  nu_xc = 1,
  ev_xc = 0.5
)
print(simul)
plot(simul)

# Checking the proportion of explained variance
x <- simul$data[, 1]
z <- as.factor(simul$theta)
summary(lm(x ~ z)) # R-squared


## Example with 2 variables contributing to clustering

# Data simulation
set.seed(1)
simul <- SimulateClustering(
  n = c(20, 10, 15), pk = 10,
  theta_xc = c(1, 1, rep(0, 8)),
  ev_xc = 0.8
)
print(simul)
plot(simul)

# Visualisation of the data
Heatmap(
  mat = simul$data,
  col = c("navy", "white", "red")
)
simul$ev # marginal proportions of explained variance

# Visualisation along contributing variables
plot(simul$data[, 1:2], col = simul$theta, pch = 19)


## Example with different levels of separation

# Data simulation
set.seed(1)
simul <- SimulateClustering(
  n = c(20, 10, 15), pk = 10,
  theta_xc = c(1, 1, rep(0, 8)),
  ev_xc = c(0.99, 0.5, rep(0, 8))
)

# Visualisation along contributing variables
plot(simul$data[, 1:2], col = simul$theta, pch = 19)


## Example with correlated contributors

# Data simulation
pk <- 10
adjacency <- matrix(0, pk, pk)
adjacency[1, 2] <- adjacency[2, 1] <- 1
set.seed(1)
sigma <- SimulateCorrelation(
  pk = pk,
  theta = adjacency,
  pd_strategy = "min_eigenvalue",
  v_within = 0.6, v_sign = -1
)$sigma
simul <- SimulateClustering(
  n = c(200, 100, 150), pk = pk, sigma = sigma,
  theta_xc = c(1, 1, rep(0, 8)),
  ev_xc = c(0.9, 0.8, rep(0, 8))
)

# Visualisation along contributing variables
plot(simul$data[, 1:2], col = simul$theta, pch = 19)

# Checking marginal proportions of explained variance
mymodel <- lm(simul$data[, 1] ~ as.factor(simul$theta))
summary(mymodel)$r.squared
mymodel <- lm(simul$data[, 2] ~ as.factor(simul$theta))
summary(mymodel)$r.squared

par(oldpar)


[Package fake version 1.4.0 Index]