R: Quickly generate synthetic data for simulation studies

generateData {missoNet}

R Documentation

Quickly generate synthetic data for simulation studies

Description

The ‘generateData’ function is used to readily produce synthetic data with randomly/systematically-missing values from a conditional Gaussian graphical model. This function supports three types of missing mechanisms that can be specified by users – missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR).

Usage

generateData(
  X = NULL,
  Beta = NULL,
  E = NULL,
  Theta = NULL,
  Sigma.X = NULL,
  n,
  p,
  q,
  rho,
  missing.type = "MCAR",
  Beta.row.sparsity = 0.2,
  Beta.elm.sparsity = 0.2,
  with.seed = NULL
)

Arguments

`X`	(Optional) a user-supplied predictor matrix (`n\times p`). The default is `'NULL'` and the program simulates the rows of `'X'` independently from `\mathcal{MVN}`(`0_p`, `\mathbf{\Sigma}_X`). A user-supplied matrix overrides this default, and the argument `'Sigma.X'` for `\mathbf{\Sigma}_X` will be ignored.
`Beta`	(Optional) a user-supplied regression coefficient matrix `\mathbf{B}` (`p\times q`). The default is `'NULL'` and the program will generate a sparse `\mathbf{B}` in which the nonzero elements are independently drawn from `\mathcal{N}(0, 1)`; the row sparsity and element sparsity of `\mathbf{B}` are controlled by the arguments `'Beta.row.sparsity'` and `'Beta.elm.sparsity'`, respectively. A user-supplied matrix overrides this default, and `'Beta.row.sparsity'` and `'Beta.elm.sparsity'` will be ignored.
`E`	(Optional) a user-supplied error matrix (`n\times q`). The default is `'NULL'` and the program simulates the rows of `'E'` independently from `\mathcal{MVN}`(`0_q`, `\mathbf{\Theta}^{-1}`). A response matrix `'Y'` without missing values is given by `'Y = X %*% Beta + E'`. A user-supplied matrix overrides this default, and the argument `'Theta'` for `\mathbf{\Theta}` will be ignored.
`Theta`	(Optional) a user-supplied positive definite precision (inverse covariance) matrix `\mathbf{\Theta}` (`q\times q`) for the response variables. The default is `'NULL'` and the program will generate a block-structured matrix having four blocks corresponding to four types of network structures: independent, weak graph, strong graph and chain. This is only needed when `'E = NULL'`.
`Sigma.X`	(Optional) A user-supplied positive definite covariance matrix `\mathbf{\Sigma}_X` (`p\times p`) for the predictor variables. The samples of `'X'` are independently drawn from a multivariate Gaussian distribution `\mathcal{MVN}`(`0_p`, `\mathbf{\Sigma}_X`). If `'Sigma.X = NULL'` (default), the program uses an AR(1) covariance with 0.7 autocorrelation (i.e., `[\mathbf{\Sigma}_X]_{jk} = 0.7^{\|j-k\|}`). This is only needed when `'X = NULL'`.
`n`	Sample size.
`p`	The dimensionality of the predictors.
`q`	The dimensionality of the responses.
`rho`	A scalar or a numeric vector of length `q` specifying the approximate proportion of missing values in each column of the response matrix.
`missing.type`	Character string: can be "`MCAR`" (default), "`MAR`" or "`MNAR`".
`Beta.row.sparsity`	A Bernoulli parameter between 0 and 1 controlling the approximate proportion of rows where at least one element could be nonzero in `\mathbf{B}`; the default is `'Beta.row.sparsity = 0.2'`. This is only needed when `'Beta = NULL'`.
`Beta.elm.sparsity`	A Bernoulli parameter between 0 and 1 controlling the approximate proportion of nonzero elements among the rows of `\mathbf{B}` where not all elements are zeros; the default is `'Beta.elm.sparsity = 0.2'`. This is only needed when `'Beta = NULL'`.
`with.seed`	A random number seed for the generative process.

Details

The dataset is simulated through the following steps:

If 'X = NULL' (default), the function ‘MASS::mvrnorm(n, mean = rep(0, p), sigma = Sigma.X)’ is used to simulate 'n' samples from a 'p'-variate Gaussian distribution for generating a predictor matrix 'X';
If 'Beta = NULL' (default), the function ‘stats::rnorm(p*q, 0, 1)’ is used to fill an empty (p \times q) dimensional matrix 'Beta', of which the row sparsity and element sparsity are later controlled by the auxiliary arguments 'Beta.row.sparsity' and 'Beta.elm.sparsity', respectively;
If 'E = NULL' (default), the function ‘MASS::mvrnorm(n, mean = rep(0, q), sigma = solve(Theta))’ is used to simulate 'n' samples from a 'q'-variate Gaussian distribution for generating an error matrix 'E';
A complete response matrix 'Y' without missing values is then generated by the command 'Y = X %*% Beta + E';
To get a response matrix 'Z' := f('Y') corrupted by missing data, the values in 'Y' are partially replaced with 'NA's following the strategy specified by the arguments 'missing.type' and 'rho'.

To better illustrate the step 5 above, suppose for all i = 1,...,n and j = 1,...,q: 'Y[i, j]' is replaced with 'NA' if 'M[i, j] == 1', where 'M' is an indicator matrix of missingness having the same dimension as 'Y'. The value of 'M[i, j]' is partially controlled by the arguments 'missing.type' and 'rho'. Below we sum up the three built-in missing mechanisms supported by the ‘generateData’ function:

'missing.type' == "MCAR": 'Y[i, j] <- NA' if 'M[i, j] == 1', where 'M[i, j] = rbinom(0, rho[j])';
'missing.type' == "MAR": 'Y[i, j] <- NA' if 'M[i, j] == 1', where 'M[i, j] = rbinom(0, (rho[j] * c / (1 + exp(-(X %*% Beta)[i, j]))))', in which c is a constant correcting the missing rate of the jth column of 'Y' to 'rho[j]';
'missing.type' == "MNAR": 'Y[i, j] <- NA' if 'M[i, j] == 1', where 'M[i, j] = 1 * (Y[i, j] < Tj)', in which 'Tj = quantile(Y[ , j], rho[j])'.

Of the aforementioned missing mechanisms, "MCAR" is random, and the other two are systematic. under "MCAR", 'M[i, j]' is not related to 'Y' or to 'X'; under "MAR", 'M[i, j]' is related to 'X', but not related to 'Y' after 'X' is controlled; under "MNAR", 'M[i, j]' is related to 'Y' itself, even after 'X' is controlled.

Value

This function returns a 'list' consisting of the following components:

`X`	A simulated (or the user-supplied) predictor matrix (`n\times p`).
`Y`	A simulated response matrix without missing values (`n\times q`).
`Z`	A simulated response matrix with missing values coded as `'NA'`s (`n\times q`).
`Beta`	The regression coefficient matrix `\mathbf{B}` for the generative model (`p\times q`).
`Theta`	The precision matrix `\mathbf{\Theta}` for the generative model (`q\times q`).
`rho`	A vector of length `q` storing the specified missing rate for each column of the response matrix.
`missing.type`	Character string: the type of missing mechanism used to generate missing values in the response matrix.

Author(s)

Yixiao Zeng yixiao.zeng@mail.mcgill.ca, Celia M.T. Greenwood and Archer Yi Yang.

Examples

## Simulate a dataset with response values missing completely at random (MCAR), 
## the overall missing rate is around 10%.
sim.dat <- generateData(n = 300, p = 50, q = 20, rho = 0.1, missing.type = "MCAR")
## -------------------------------------------------------------------------------
## Fit a missoNet model using the simulated dataset.
X <- sim.dat$X  # predictor matrix
Y <- sim.dat$Z  # corrupted response matrix
fit <- missoNet(X = X, Y = Y, lambda.Beta = 0.1, lambda.Theta = 0.1)


## Simulate a dataset with response values missing at random (MAR), the approximate 
## missing rate for each column of the response matrix is specified through a vector 'rho'.
## 
## The row sparsity and element sparsity of the auto-generated 'Beta' could be 
## adjusted correspondingly by using 'Beta.row.sparsity' and 'Beta.elm.sparsity'.
n <- 300; p <- 50; q <- 20
rho <- runif(q, min = 0, max = 0.2)
sim.dat <- generateData(n = n, p = p, q = q, rho = rho, missing.type = "MAR",
                        Beta.row.sparsity = 0.3, Beta.elm.sparsity = 0.2)


## Simulate a dataset with response values missing not at random (MNAR), 
## using the user-supplied 'Beta' and 'Theta'.
n <- 300; p <- 50; q <- 20
Beta <- matrix(rnorm(p*q, 0, 1), p, q)  # a nonsparse 'Beta' (p x q)
Theta <- diag(q)  # a diagonal 'Theta' (q x q)
sim.dat <- generateData(Beta = Beta, Theta = Theta, n = n, p = p, q = q,
                        rho = 0.1, missing.type = "MNAR")
## ---------------------------------------------------------------------          
## Specifying just one of 'Beta' and 'Theta' is also allowed.
sim.dat <- generateData(Theta = Theta, n = n, p = p, q = q,
                        rho = 0.1, missing.type = "MNAR")


## User-supplied 'X', 'Beta' and 'E', in which case 'Y' is deterministic.
n <- 300; p <- 50; q <- 20
X <- matrix(rnorm(n*p, 0, 1), n, p)
Beta <- matrix(rnorm(p*q, 0, 1), p, q)
E <- mvtnorm::rmvnorm(n, rep(0, q), sigma = diag(q))
sim.dat <- generateData(X = X, Beta = Beta, E = E, n = n, p = p, q = q,
                        rho = 0.1, missing.type = "MCAR")

[Package missoNet version 1.2.0 Index]