SimulateClustering {fake} | R Documentation |
Simulation of data with underlying clusters
Description
Simulates mixture multivariate Normal data with clusters of items (rows) sharing similar profiles along (a subset of) attributes (columns).
Usage
SimulateClustering(
n = c(10, 10),
pk = 10,
sigma = NULL,
theta_xc = NULL,
nu_xc = 1,
ev_xc = 0.5,
output_matrices = FALSE
)
Arguments
n |
vector of the number of items per cluster in the simulated data. The
total number of items is |
pk |
vector of the number of attributes in the simulated data. |
sigma |
optional within-cluster correlation matrix. |
theta_xc |
optional binary matrix encoding which attributes (columns)
contribute to the clustering structure between which clusters (rows). If
|
nu_xc |
expected proportion of variables contributing to the clustering
over the total number of variables. Only used if |
ev_xc |
vector of expected proportion of variance in each of the contributing attributes that can be explained by the clustering. |
output_matrices |
logical indicating if the cluster and attribute specific means and cluster specific covariance matrix should be included in the output. |
Details
The data is simulated from a Gaussian mixture where for all i
\in {1, \dots, n}
:
Z_i i.i.d. ~ M(1, \kappa)
X_i | Z_i indep. ~ N_p(\mu_{Z_i}, \Sigma)
where M(1, \kappa)
is the multinomial distribution with parameters 1
and \kappa
, the vector of length G
(the number of clusters)
with probabilities of belonging to each of the clusters, and
N_p(\mu_{Z_i}, \Sigma)
is the multivariate Normal distribution with a
mean vector \mu_{Z_i}
that depends on the cluster membership encoded
in Z_i
and the same covariance matrix \Sigma
within all G
clusters.
The mean vectors \mu_{g}, g \in {1, \dots G}
are simulated so that
the desired proportion of variance in each of attributes explained by the
clustering (argument ev_xc
) is reached.
The covariance matrix \Sigma
is obtained by re-scaling a correlation
matrix (argument sigma
) to ensure that the desired proportions of
explained variances by the clustering (argument ev_xc
) are reached.
Value
A list with:
data |
simulated data with |
theta |
simulated (true) cluster membership. |
theta_xc |
binary vector encoding variables contributing to the clustering structure. |
ev |
vector of marginal expected proportions of explained variance for each variable. |
mu_mixture |
simulated (true) cluster-specific means. Only returned if
|
sigma |
simulated (true) covariance
matrix. Only returned if |
See Also
Other simulation functions:
SimulateAdjacency()
,
SimulateComponents()
,
SimulateCorrelation()
,
SimulateGraphical()
,
SimulateRegression()
,
SimulateStructural()
Examples
oldpar <- par(no.readonly = TRUE)
par(mar = rep(7, 4))
## Example with 3 clusters
# Data simulation
set.seed(1)
simul <- SimulateClustering(
n = c(10, 30, 15),
nu_xc = 1,
ev_xc = 0.5
)
print(simul)
plot(simul)
# Checking the proportion of explained variance
x <- simul$data[, 1]
z <- as.factor(simul$theta)
summary(lm(x ~ z)) # R-squared
## Example with 2 variables contributing to clustering
# Data simulation
set.seed(1)
simul <- SimulateClustering(
n = c(20, 10, 15), pk = 10,
theta_xc = c(1, 1, rep(0, 8)),
ev_xc = 0.8
)
print(simul)
plot(simul)
# Visualisation of the data
Heatmap(
mat = simul$data,
col = c("navy", "white", "red")
)
simul$ev # marginal proportions of explained variance
# Visualisation along contributing variables
plot(simul$data[, 1:2], col = simul$theta, pch = 19)
## Example with different levels of separation
# Data simulation
set.seed(1)
simul <- SimulateClustering(
n = c(20, 10, 15), pk = 10,
theta_xc = c(1, 1, rep(0, 8)),
ev_xc = c(0.99, 0.5, rep(0, 8))
)
# Visualisation along contributing variables
plot(simul$data[, 1:2], col = simul$theta, pch = 19)
## Example with correlated contributors
# Data simulation
pk <- 10
adjacency <- matrix(0, pk, pk)
adjacency[1, 2] <- adjacency[2, 1] <- 1
set.seed(1)
sigma <- SimulateCorrelation(
pk = pk,
theta = adjacency,
pd_strategy = "min_eigenvalue",
v_within = 0.6, v_sign = -1
)$sigma
simul <- SimulateClustering(
n = c(200, 100, 150), pk = pk, sigma = sigma,
theta_xc = c(1, 1, rep(0, 8)),
ev_xc = c(0.9, 0.8, rep(0, 8))
)
# Visualisation along contributing variables
plot(simul$data[, 1:2], col = simul$theta, pch = 19)
# Checking marginal proportions of explained variance
mymodel <- lm(simul$data[, 1] ~ as.factor(simul$theta))
summary(mymodel)$r.squared
mymodel <- lm(simul$data[, 2] ~ as.factor(simul$theta))
summary(mymodel)$r.squared
par(oldpar)