generateData {missoNet}R Documentation

Quickly generate synthetic data for simulation studies

Description

The ‘generateData’ function is used to readily produce synthetic data with randomly/systematically-missing values from a conditional Gaussian graphical model. This function supports three types of missing mechanisms that can be specified by users – missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR).

Usage

generateData(
  X = NULL,
  Beta = NULL,
  E = NULL,
  Theta = NULL,
  Sigma.X = NULL,
  n,
  p,
  q,
  rho,
  missing.type = "MCAR",
  Beta.row.sparsity = 0.2,
  Beta.elm.sparsity = 0.2,
  with.seed = NULL
)

Arguments

X

(Optional) a user-supplied predictor matrix (n\times p). The default is 'NULL' and the program simulates the rows of 'X' independently from \mathcal{MVN}(0_p, \mathbf{\Sigma}_X). A user-supplied matrix overrides this default, and the argument 'Sigma.X' for \mathbf{\Sigma}_X will be ignored.

Beta

(Optional) a user-supplied regression coefficient matrix \mathbf{B} (p\times q). The default is 'NULL' and the program will generate a sparse \mathbf{B} in which the nonzero elements are independently drawn from \mathcal{N}(0, 1); the row sparsity and element sparsity of \mathbf{B} are controlled by the arguments 'Beta.row.sparsity' and 'Beta.elm.sparsity', respectively. A user-supplied matrix overrides this default, and 'Beta.row.sparsity' and 'Beta.elm.sparsity' will be ignored.

E

(Optional) a user-supplied error matrix (n\times q). The default is 'NULL' and the program simulates the rows of 'E' independently from \mathcal{MVN}(0_q, \mathbf{\Theta}^{-1}). A response matrix 'Y' without missing values is given by 'Y = X %*% Beta + E'. A user-supplied matrix overrides this default, and the argument 'Theta' for \mathbf{\Theta} will be ignored.

Theta

(Optional) a user-supplied positive definite precision (inverse covariance) matrix \mathbf{\Theta} (q\times q) for the response variables. The default is 'NULL' and the program will generate a block-structured matrix having four blocks corresponding to four types of network structures: independent, weak graph, strong graph and chain. This is only needed when 'E = NULL'.

Sigma.X

(Optional) A user-supplied positive definite covariance matrix \mathbf{\Sigma}_X (p\times p) for the predictor variables. The samples of 'X' are independently drawn from a multivariate Gaussian distribution \mathcal{MVN}(0_p, \mathbf{\Sigma}_X). If 'Sigma.X = NULL' (default), the program uses an AR(1) covariance with 0.7 autocorrelation (i.e., [\mathbf{\Sigma}_X]_{jk} = 0.7^{|j-k|}). This is only needed when 'X = NULL'.

n

Sample size.

p

The dimensionality of the predictors.

q

The dimensionality of the responses.

rho

A scalar or a numeric vector of length q specifying the approximate proportion of missing values in each column of the response matrix.

missing.type

Character string: can be "MCAR" (default), "MAR" or "MNAR".

Beta.row.sparsity

A Bernoulli parameter between 0 and 1 controlling the approximate proportion of rows where at least one element could be nonzero in \mathbf{B}; the default is 'Beta.row.sparsity = 0.2'. This is only needed when 'Beta = NULL'.

Beta.elm.sparsity

A Bernoulli parameter between 0 and 1 controlling the approximate proportion of nonzero elements among the rows of \mathbf{B} where not all elements are zeros; the default is 'Beta.elm.sparsity = 0.2'. This is only needed when 'Beta = NULL'.

with.seed

A random number seed for the generative process.

Details

The dataset is simulated through the following steps:

  1. If 'X = NULL' (default), the function ‘MASS::mvrnorm(n, mean = rep(0, p), sigma = Sigma.X)’ is used to simulate 'n' samples from a 'p'-variate Gaussian distribution for generating a predictor matrix 'X';

  2. If 'Beta = NULL' (default), the function ‘stats::rnorm(p*q, 0, 1)’ is used to fill an empty (p \times q) dimensional matrix 'Beta', of which the row sparsity and element sparsity are later controlled by the auxiliary arguments 'Beta.row.sparsity' and 'Beta.elm.sparsity', respectively;

  3. If 'E = NULL' (default), the function ‘MASS::mvrnorm(n, mean = rep(0, q), sigma = solve(Theta))’ is used to simulate 'n' samples from a 'q'-variate Gaussian distribution for generating an error matrix 'E';

  4. A complete response matrix 'Y' without missing values is then generated by the command 'Y = X %*% Beta + E';

  5. To get a response matrix 'Z' := f('Y') corrupted by missing data, the values in 'Y' are partially replaced with 'NA's following the strategy specified by the arguments 'missing.type' and 'rho'.

To better illustrate the step 5 above, suppose for all i = 1,...,n and j = 1,...,q: 'Y[i, j]' is replaced with 'NA' if 'M[i, j] == 1', where 'M' is an indicator matrix of missingness having the same dimension as 'Y'. The value of 'M[i, j]' is partially controlled by the arguments 'missing.type' and 'rho'. Below we sum up the three built-in missing mechanisms supported by the ‘generateData’ function:

Of the aforementioned missing mechanisms, "MCAR" is random, and the other two are systematic. under "MCAR", 'M[i, j]' is not related to 'Y' or to 'X'; under "MAR", 'M[i, j]' is related to 'X', but not related to 'Y' after 'X' is controlled; under "MNAR", 'M[i, j]' is related to 'Y' itself, even after 'X' is controlled.

Value

This function returns a 'list' consisting of the following components:

X

A simulated (or the user-supplied) predictor matrix (n\times p).

Y

A simulated response matrix without missing values (n\times q).

Z

A simulated response matrix with missing values coded as 'NA's (n\times q).

Beta

The regression coefficient matrix \mathbf{B} for the generative model (p\times q).

Theta

The precision matrix \mathbf{\Theta} for the generative model (q\times q).

rho

A vector of length q storing the specified missing rate for each column of the response matrix.

missing.type

Character string: the type of missing mechanism used to generate missing values in the response matrix.

Author(s)

Yixiao Zeng yixiao.zeng@mail.mcgill.ca, Celia M.T. Greenwood and Archer Yi Yang.

Examples

## Simulate a dataset with response values missing completely at random (MCAR), 
## the overall missing rate is around 10%.
sim.dat <- generateData(n = 300, p = 50, q = 20, rho = 0.1, missing.type = "MCAR")
## -------------------------------------------------------------------------------
## Fit a missoNet model using the simulated dataset.
X <- sim.dat$X  # predictor matrix
Y <- sim.dat$Z  # corrupted response matrix
fit <- missoNet(X = X, Y = Y, lambda.Beta = 0.1, lambda.Theta = 0.1)


## Simulate a dataset with response values missing at random (MAR), the approximate 
## missing rate for each column of the response matrix is specified through a vector 'rho'.
## 
## The row sparsity and element sparsity of the auto-generated 'Beta' could be 
## adjusted correspondingly by using 'Beta.row.sparsity' and 'Beta.elm.sparsity'.
n <- 300; p <- 50; q <- 20
rho <- runif(q, min = 0, max = 0.2)
sim.dat <- generateData(n = n, p = p, q = q, rho = rho, missing.type = "MAR",
                        Beta.row.sparsity = 0.3, Beta.elm.sparsity = 0.2)


## Simulate a dataset with response values missing not at random (MNAR), 
## using the user-supplied 'Beta' and 'Theta'.
n <- 300; p <- 50; q <- 20
Beta <- matrix(rnorm(p*q, 0, 1), p, q)  # a nonsparse 'Beta' (p x q)
Theta <- diag(q)  # a diagonal 'Theta' (q x q)
sim.dat <- generateData(Beta = Beta, Theta = Theta, n = n, p = p, q = q,
                        rho = 0.1, missing.type = "MNAR")
## ---------------------------------------------------------------------          
## Specifying just one of 'Beta' and 'Theta' is also allowed.
sim.dat <- generateData(Theta = Theta, n = n, p = p, q = q,
                        rho = 0.1, missing.type = "MNAR")


## User-supplied 'X', 'Beta' and 'E', in which case 'Y' is deterministic.
n <- 300; p <- 50; q <- 20
X <- matrix(rnorm(n*p, 0, 1), n, p)
Beta <- matrix(rnorm(p*q, 0, 1), p, q)
E <- mvtnorm::rmvnorm(n, rep(0, q), sigma = diag(q))
sim.dat <- generateData(X = X, Beta = Beta, E = E, n = n, p = p, q = q,
                        rho = 0.1, missing.type = "MCAR")

[Package missoNet version 1.2.0 Index]