R: Simulate Data from a Mixture of Factor Analysers Structure

sim_IMIFA {IMIFA}

R Documentation

Simulate Data from a Mixture of Factor Analysers Structure

Description

Functions to simulate data of any size and dimension from a (infinite) mixture of (infinite) factor analysers parameterisation or fitted object.

Usage

sim_IMIFA_data(N = 300L,
               G = 3L,
               P = 50L,
               Q = rep(floor(log(P)), G),
               pis = rep(1/G, G),
               mu = NULL,
               psi = NULL,
               loadings = NULL,
               scores = NULL,
               nn = NULL,
               loc.diff = 2,
               non.zero = P,
               forceQg = TRUE,
               method = c("conditional", "marginal"))

sim_IMIFA_model(res,
                method = c("conditional", "marginal"))

Arguments

`N`, `G`, `P`	Desired overall number of observations, number of clusters, and number of variables in the simulated data set. All must be a single integer.
`Q`	Desired number of cluster-specific latent factors in the simulated data set. Can be specified either as a single integer if all clusters are to have the same number of factors, or a vector of length `G`. Defaults to `floor(log(P))` in each cluster. Should be less than the associated `Ledermann` bound and the number of observations in the corresponding cluster. The argument `forceQg` can be used to enforce this upper limit. It is also advisable that `Q <= floor((P - 1)/2)`, but this restriction is not enforced by `forceQg`.
`pis`	Mixing proportions of the clusters in the data set if `G` > 1. Must sum to 1. Defaults to `rep(1/G, G)`.
`mu`	True values of the mean parameters, either as a single value, a vector of length `G`, a vector of length `P`, or a `G * P` matrix. If `mu` is missing, `loc.diff` is invoked to simulate distinct means for each cluster by default.
`psi`	True values of uniqueness parameters, either as a single value, a vector of length `G`, a vector of length `P`, or a `G * P` matrix. As such the user can specify uniquenesses as a diagonal or isotropic matrix, and further constrain uniquenesses across clusters if desired. If `psi` is missing, uniquenesses are simulated via `1/rgamma(P, 2, 1)` within each cluster by default.
`loadings`	True values of the loadings matrix/matrices. Must be supplied in the form of a list of numeric matrices when `G > 1`, otherwise a single matrix. Matrices must contain `P` rows and the number of columns must correspond to the values in `Q`. If `loadings` are not supplied, such matrices are populated with standard normal random variates by default (see `non.zero`).
`scores`	True values of the latent factor scores, as a `N * max(Q)` numeric matrix. If `scores` are not supplied, such a matrix is populated with standard normal random variates by default. Only relevant when `method="conditional"`.
`nn`	An alternative way to specify the size of each cluster, by giving the exact number of observations in each cluster explicitly. Must sum to `N`.
`loc.diff`	A parameter to control the closeness of the clusters in terms of the difference in their location vectors. Only relevant if `mu` is NOT supplied. Defaults to `2`. More specifically, `loc.diff` (if invoked) is invoked as follows: means are simulated with the vector of cluster-specific hypermeans given by: `scale(1:G, center=TRUE, scale=FALSE) * loc.diff`.
`non.zero`	Controls the number of non-zero entries in each loadings column (per cluster) only when `loadings` is not explicitly supplied. Values must be integers in the interval `[1,P]`. Defaults to `P`. The positions of the zeros are randomised, and non-zero entries are drawn from a standard normal. Must be given as a list of length `G` of vectors of length corresponding to `Q` when `G>1`. Can be given either as such a list or simply a vector of length `Q` when `G=1`. Alternatively, a single integer can be supplied, common across all loadings columns across all clusters. In any case, `non.zero` will be affected by `forceQg=TRUE` by default (see below).
`forceQg`	A logical indicating whether the upper limit on the number of cluster-specific factors `Q` is enforced. Defaults to `TRUE` for `sim_IMIFA_data`, but is always `FALSE` for `sim_IMIFA_model`. Note that when `forceQg=TRUE` is invoked, `non.zero` (see above) is also affected. This upper limit is determined by the `Ledermann` bound and that `Q` must be less than the number of observations in the given cluster. It is also advisable that `Q <= floor((P - 1)/2)`, but this restriction is not enforced by `forceQg`.
`method`	A switch indicating whether the mixture to be simulated from is the conditional distribution of the data given the latent variables (default), or simply the marginal distribution of the data.
`res`	An object of class `"Results_IMIFA"` generated by `get_IMIFA_results`.

Details

sim_IMIFA_model is a simple wrapper to sim_IMIFA_data which uses the estimated parameters of a fitted IMIFA related model, as generated by get_IMIFA_results. The necessary parameters must have been originally stored via storeControl in the creation of res.

Value

Invisibly returns a data.frame with N observations (rows) of P variables (columns). The true values of the parameters which generated these data are also stored as attributes.

Note

N, G, P & Q will NOT be inferred from the supplied parameters pis, mu, psi, loadings, scores & nn - rather, the parameters' length/dimensions must adhere to the supplied values of N, G, P & Q.

Missing values are not allowed in any of pis, mu, psi, loadings, scores & nn.

Author(s)

Keefe Murphy - <keefe.murphy@mu.ie>

References

Murphy, K., Viroli, C., and Gormley, I. C. (2020) Infinite mixtures of infinite factor analysers, Bayesian Analysis, 15(3): 937-963. <doi:10.1214/19-BA1179>.

Examples

# Simulate 100 observations from 3 balanced clusters with cluster-specific numbers of latent factors
# Specify isotropic uniquenesses within each cluster
# Supply cluster means directly
sim_data  <- sim_IMIFA_data(N=100, G=3, P=20, Q=c(2, 2, 5), psi=1:3,
                            mu=matrix(rnorm(60, -2 + 1:3, 1), nrow=20, ncol=3, byrow=TRUE))
names(attributes(sim_data))
labels    <- attr(sim_data, "Labels")

# Visualise the data in two-dimensions
plot(cmdscale(dist(sim_data), k=2), col=labels)

# Examine the overlap with a pairs plot of 5 randomly chosen variables
pairs(sim_data[,sample(1:20, 5)], col=labels)

# Fit a MIFA model to this data
# tmp     <- mcmc_IMIFA(sim_data, method="MIFA", range.G=3, n.iters=5000)

# Simulate from this model
# res     <- get_IMIFA_results(tmp, zlabels=labels)
# sim_mod <- sim_IMIFA_model(res)

[Package IMIFA version 2.2.0 Index]