SimulateClustering {fake} | R Documentation |
Simulation of data with underlying clusters
Description
Simulates mixture multivariate Normal data with clusters of items (rows) sharing similar profiles along (a subset of) attributes (columns).
Usage
SimulateClustering(
n = c(10, 10),
pk = 10,
sigma = NULL,
theta_xc = NULL,
nu_xc = 1,
ev_xc = 0.5,
output_matrices = FALSE
)
Arguments
n |
vector of the number of items per cluster in the simulated data. The
total number of items is |
pk |
vector of the number of attributes in the simulated data. |
sigma |
optional within-cluster correlation matrix. |
theta_xc |
optional binary matrix encoding which attributes (columns)
contribute to the clustering structure between which clusters (rows). If
|
nu_xc |
expected proportion of variables contributing to the clustering
over the total number of variables. Only used if |
ev_xc |
vector of expected proportion of variance in each of the contributing attributes that can be explained by the clustering. |
output_matrices |
logical indicating if the cluster and attribute specific means and cluster specific covariance matrix should be included in the output. |
Details
The data is simulated from a Gaussian mixture where for all :
where is the multinomial distribution with parameters 1
and
, the vector of length
(the number of clusters)
with probabilities of belonging to each of the clusters, and
is the multivariate Normal distribution with a
mean vector
that depends on the cluster membership encoded
in
and the same covariance matrix
within all
clusters.
The mean vectors are simulated so that
the desired proportion of variance in each of attributes explained by the
clustering (argument
ev_xc
) is reached.
The covariance matrix is obtained by re-scaling a correlation
matrix (argument
sigma
) to ensure that the desired proportions of
explained variances by the clustering (argument ev_xc
) are reached.
Value
A list with:
data |
simulated data with |
theta |
simulated (true) cluster membership. |
theta_xc |
binary vector encoding variables contributing to the clustering structure. |
ev |
vector of marginal expected proportions of explained variance for each variable. |
mu_mixture |
simulated (true) cluster-specific means. Only returned if
|
sigma |
simulated (true) covariance
matrix. Only returned if |
See Also
Other simulation functions:
SimulateAdjacency()
,
SimulateComponents()
,
SimulateCorrelation()
,
SimulateGraphical()
,
SimulateRegression()
,
SimulateStructural()
Examples
oldpar <- par(no.readonly = TRUE)
par(mar = rep(7, 4))
## Example with 3 clusters
# Data simulation
set.seed(1)
simul <- SimulateClustering(
n = c(10, 30, 15),
nu_xc = 1,
ev_xc = 0.5
)
print(simul)
plot(simul)
# Checking the proportion of explained variance
x <- simul$data[, 1]
z <- as.factor(simul$theta)
summary(lm(x ~ z)) # R-squared
## Example with 2 variables contributing to clustering
# Data simulation
set.seed(1)
simul <- SimulateClustering(
n = c(20, 10, 15), pk = 10,
theta_xc = c(1, 1, rep(0, 8)),
ev_xc = 0.8
)
print(simul)
plot(simul)
# Visualisation of the data
Heatmap(
mat = simul$data,
col = c("navy", "white", "red")
)
simul$ev # marginal proportions of explained variance
# Visualisation along contributing variables
plot(simul$data[, 1:2], col = simul$theta, pch = 19)
## Example with different levels of separation
# Data simulation
set.seed(1)
simul <- SimulateClustering(
n = c(20, 10, 15), pk = 10,
theta_xc = c(1, 1, rep(0, 8)),
ev_xc = c(0.99, 0.5, rep(0, 8))
)
# Visualisation along contributing variables
plot(simul$data[, 1:2], col = simul$theta, pch = 19)
## Example with correlated contributors
# Data simulation
pk <- 10
adjacency <- matrix(0, pk, pk)
adjacency[1, 2] <- adjacency[2, 1] <- 1
set.seed(1)
sigma <- SimulateCorrelation(
pk = pk,
theta = adjacency,
pd_strategy = "min_eigenvalue",
v_within = 0.6, v_sign = -1
)$sigma
simul <- SimulateClustering(
n = c(200, 100, 150), pk = pk, sigma = sigma,
theta_xc = c(1, 1, rep(0, 8)),
ev_xc = c(0.9, 0.8, rep(0, 8))
)
# Visualisation along contributing variables
plot(simul$data[, 1:2], col = simul$theta, pch = 19)
# Checking marginal proportions of explained variance
mymodel <- lm(simul$data[, 1] ~ as.factor(simul$theta))
summary(mymodel)$r.squared
mymodel <- lm(simul$data[, 2] ~ as.factor(simul$theta))
summary(mymodel)$r.squared
par(oldpar)