sim_IMIFA {IMIFA}R Documentation

Simulate Data from a Mixture of Factor Analysers Structure

Description

Functions to simulate data of any size and dimension from a (infinite) mixture of (infinite) factor analysers parameterisation or fitted object.

Usage

sim_IMIFA_data(N = 300L,
               G = 3L,
               P = 50L,
               Q = rep(floor(log(P)), G),
               pis = rep(1/G, G),
               mu = NULL,
               psi = NULL,
               loadings = NULL,
               scores = NULL,
               nn = NULL,
               loc.diff = 2,
               non.zero = P,
               forceQg = TRUE,
               method = c("conditional", "marginal"))

sim_IMIFA_model(res,
                method = c("conditional", "marginal"))

Arguments

N, G, P

Desired overall number of observations, number of clusters, and number of variables in the simulated data set. All must be a single integer.

Q

Desired number of cluster-specific latent factors in the simulated data set. Can be specified either as a single integer if all clusters are to have the same number of factors, or a vector of length G. Defaults to floor(log(P)) in each cluster. Should be less than the associated Ledermann bound and the number of observations in the corresponding cluster. The argument forceQg can be used to enforce this upper limit. It is also advisable that Q <= floor((P - 1)/2), but this restriction is not enforced by forceQg.

pis

Mixing proportions of the clusters in the data set if G > 1. Must sum to 1. Defaults to rep(1/G, G).

mu

True values of the mean parameters, either as a single value, a vector of length G, a vector of length P, or a G * P matrix. If mu is missing, loc.diff is invoked to simulate distinct means for each cluster by default.

psi

True values of uniqueness parameters, either as a single value, a vector of length G, a vector of length P, or a G * P matrix. As such the user can specify uniquenesses as a diagonal or isotropic matrix, and further constrain uniquenesses across clusters if desired. If psi is missing, uniquenesses are simulated via 1/rgamma(P, 2, 1) within each cluster by default.

loadings

True values of the loadings matrix/matrices. Must be supplied in the form of a list of numeric matrices when G > 1, otherwise a single matrix. Matrices must contain P rows and the number of columns must correspond to the values in Q. If loadings are not supplied, such matrices are populated with standard normal random variates by default (see non.zero).

scores

True values of the latent factor scores, as a N * max(Q) numeric matrix. If scores are not supplied, such a matrix is populated with standard normal random variates by default. Only relevant when method="conditional".

nn

An alternative way to specify the size of each cluster, by giving the exact number of observations in each cluster explicitly. Must sum to N.

loc.diff

A parameter to control the closeness of the clusters in terms of the difference in their location vectors. Only relevant if mu is NOT supplied. Defaults to 2.

More specifically, loc.diff (if invoked) is invoked as follows: means are simulated with the vector of cluster-specific hypermeans given by:

scale(1:G, center=TRUE, scale=FALSE) * loc.diff.

non.zero

Controls the number of non-zero entries in each loadings column (per cluster) only when loadings is not explicitly supplied. Values must be integers in the interval [1,P]. Defaults to P. The positions of the zeros are randomised, and non-zero entries are drawn from a standard normal.

Must be given as a list of length G of vectors of length corresponding to Q when G>1. Can be given either as such a list or simply a vector of length Q when G=1. Alternatively, a single integer can be supplied, common across all loadings columns across all clusters. In any case, non.zero will be affected by forceQg=TRUE by default (see below).

forceQg

A logical indicating whether the upper limit on the number of cluster-specific factors Q is enforced. Defaults to TRUE for sim_IMIFA_data, but is always FALSE for sim_IMIFA_model. Note that when forceQg=TRUE is invoked, non.zero (see above) is also affected. This upper limit is determined by the Ledermann bound and that Q must be less than the number of observations in the given cluster. It is also advisable that Q <= floor((P - 1)/2), but this restriction is not enforced by forceQg.

method

A switch indicating whether the mixture to be simulated from is the conditional distribution of the data given the latent variables (default), or simply the marginal distribution of the data.

res

An object of class "Results_IMIFA" generated by get_IMIFA_results.

Details

sim_IMIFA_model is a simple wrapper to sim_IMIFA_data which uses the estimated parameters of a fitted IMIFA related model, as generated by get_IMIFA_results. The necessary parameters must have been originally stored via storeControl in the creation of res.

Value

Invisibly returns a data.frame with N observations (rows) of P variables (columns). The true values of the parameters which generated these data are also stored as attributes.

Note

N, G, P & Q will NOT be inferred from the supplied parameters pis, mu, psi, loadings, scores & nn - rather, the parameters' length/dimensions must adhere to the supplied values of N, G, P & Q.

Missing values are not allowed in any of pis, mu, psi, loadings, scores & nn.

Author(s)

Keefe Murphy - <keefe.murphy@mu.ie>

References

Murphy, K., Viroli, C., and Gormley, I. C. (2020) Infinite mixtures of infinite factor analysers, Bayesian Analysis, 15(3): 937-963. <doi:10.1214/19-BA1179>.

See Also

mcmc_IMIFA for fitting an IMIFA related model to the simulated data set.

get_IMIFA_results for generating input for sim_IMIFA_model.

Ledermann for details on the upper-bound for Q. Note that this function accounts for isotropic uniquenesses, if psi is supplied in that manner, in computing this bound.

Examples

# Simulate 100 observations from 3 balanced clusters with cluster-specific numbers of latent factors
# Specify isotropic uniquenesses within each cluster
# Supply cluster means directly
sim_data  <- sim_IMIFA_data(N=100, G=3, P=20, Q=c(2, 2, 5), psi=1:3,
                            mu=matrix(rnorm(60, -2 + 1:3, 1), nrow=20, ncol=3, byrow=TRUE))
names(attributes(sim_data))
labels    <- attr(sim_data, "Labels")

# Visualise the data in two-dimensions
plot(cmdscale(dist(sim_data), k=2), col=labels)

# Examine the overlap with a pairs plot of 5 randomly chosen variables
pairs(sim_data[,sample(1:20, 5)], col=labels)

# Fit a MIFA model to this data
# tmp     <- mcmc_IMIFA(sim_data, method="MIFA", range.G=3, n.iters=5000)

# Simulate from this model
# res     <- get_IMIFA_results(tmp, zlabels=labels)
# sim_mod <- sim_IMIFA_model(res)

[Package IMIFA version 2.2.0 Index]