| generateData {missoNet} | R Documentation |
Quickly generate synthetic data for simulation studies
Description
The ‘generateData’ function is used to readily produce synthetic data with randomly/systematically-missing values from a conditional Gaussian graphical model.
This function supports three types of missing mechanisms that can be specified by users – missing completely at random (MCAR), missing at random (MAR) and
missing not at random (MNAR).
Usage
generateData(
X = NULL,
Beta = NULL,
E = NULL,
Theta = NULL,
Sigma.X = NULL,
n,
p,
q,
rho,
missing.type = "MCAR",
Beta.row.sparsity = 0.2,
Beta.elm.sparsity = 0.2,
with.seed = NULL
)
Arguments
X |
(Optional) a user-supplied predictor matrix ( |
Beta |
(Optional) a user-supplied regression coefficient matrix |
E |
(Optional) a user-supplied error matrix ( |
Theta |
(Optional) a user-supplied positive definite precision (inverse covariance) matrix |
Sigma.X |
(Optional) A user-supplied positive definite covariance matrix |
n |
Sample size. |
p |
The dimensionality of the predictors. |
q |
The dimensionality of the responses. |
rho |
A scalar or a numeric vector of length |
missing.type |
Character string: can be " |
Beta.row.sparsity |
A Bernoulli parameter between 0 and 1 controlling the approximate proportion of rows where at least one element could be nonzero in |
Beta.elm.sparsity |
A Bernoulli parameter between 0 and 1 controlling the approximate proportion of nonzero elements among the rows of |
with.seed |
A random number seed for the generative process. |
Details
The dataset is simulated through the following steps:
If
'X = NULL'(default), the function ‘MASS::mvrnorm(n, mean = rep(0, p), sigma = Sigma.X)’ is used to simulate'n'samples from a'p'-variate Gaussian distribution for generating a predictor matrix'X';If
'Beta = NULL'(default), the function ‘stats::rnorm(p*q, 0, 1)’ is used to fill an empty (p \times q) dimensional matrix'Beta', of which the row sparsity and element sparsity are later controlled by the auxiliary arguments'Beta.row.sparsity'and'Beta.elm.sparsity', respectively;If
'E = NULL'(default), the function ‘MASS::mvrnorm(n, mean = rep(0, q), sigma = solve(Theta))’ is used to simulate'n'samples from a'q'-variate Gaussian distribution for generating an error matrix'E';A complete response matrix
'Y'without missing values is then generated by the command'Y = X %*% Beta + E';To get a response matrix
'Z':=f('Y') corrupted by missing data, the values in'Y'are partially replaced with'NA's following the strategy specified by the arguments'missing.type'and'rho'.
To better illustrate the step 5 above, suppose for all i = 1,...,n and j = 1,...,q: 'Y[i, j]' is replaced with 'NA'
if 'M[i, j] == 1', where 'M' is an indicator matrix of missingness having the same dimension as 'Y'.
The value of 'M[i, j]' is partially controlled by the arguments 'missing.type' and 'rho'.
Below we sum up the three built-in missing mechanisms supported by the ‘generateData’ function:
-
'missing.type'== "MCAR":'Y[i, j] <- NA'if'M[i, j] == 1', where'M[i, j] = rbinom(0, rho[j])'; -
'missing.type'== "MAR":'Y[i, j] <- NA'if'M[i, j] == 1', where'M[i, j] = rbinom(0, (rho[j] * c / (1 + exp(-(X %*% Beta)[i, j]))))', in whichcis a constant correcting the missing rate of thejth column of'Y'to'rho[j]'; -
'missing.type'== "MNAR":'Y[i, j] <- NA'if'M[i, j] == 1', where'M[i, j] = 1 * (Y[i, j] < Tj)', in which'Tj = quantile(Y[ , j], rho[j])'.
Of the aforementioned missing mechanisms, "MCAR" is random, and the other two are systematic.
under "MCAR", 'M[i, j]' is not related to 'Y' or to 'X';
under "MAR", 'M[i, j]' is related to 'X', but not related to 'Y' after 'X' is controlled;
under "MNAR", 'M[i, j]' is related to 'Y' itself, even after 'X' is controlled.
Value
This function returns a 'list' consisting of the following components:
X |
A simulated (or the user-supplied) predictor matrix ( |
Y |
A simulated response matrix without missing values ( |
Z |
A simulated response matrix with missing values coded as |
Beta |
The regression coefficient matrix |
Theta |
The precision matrix |
rho |
A vector of length |
missing.type |
Character string: the type of missing mechanism used to generate missing values in the response matrix. |
Author(s)
Yixiao Zeng yixiao.zeng@mail.mcgill.ca, Celia M.T. Greenwood and Archer Yi Yang.
Examples
## Simulate a dataset with response values missing completely at random (MCAR),
## the overall missing rate is around 10%.
sim.dat <- generateData(n = 300, p = 50, q = 20, rho = 0.1, missing.type = "MCAR")
## -------------------------------------------------------------------------------
## Fit a missoNet model using the simulated dataset.
X <- sim.dat$X # predictor matrix
Y <- sim.dat$Z # corrupted response matrix
fit <- missoNet(X = X, Y = Y, lambda.Beta = 0.1, lambda.Theta = 0.1)
## Simulate a dataset with response values missing at random (MAR), the approximate
## missing rate for each column of the response matrix is specified through a vector 'rho'.
##
## The row sparsity and element sparsity of the auto-generated 'Beta' could be
## adjusted correspondingly by using 'Beta.row.sparsity' and 'Beta.elm.sparsity'.
n <- 300; p <- 50; q <- 20
rho <- runif(q, min = 0, max = 0.2)
sim.dat <- generateData(n = n, p = p, q = q, rho = rho, missing.type = "MAR",
Beta.row.sparsity = 0.3, Beta.elm.sparsity = 0.2)
## Simulate a dataset with response values missing not at random (MNAR),
## using the user-supplied 'Beta' and 'Theta'.
n <- 300; p <- 50; q <- 20
Beta <- matrix(rnorm(p*q, 0, 1), p, q) # a nonsparse 'Beta' (p x q)
Theta <- diag(q) # a diagonal 'Theta' (q x q)
sim.dat <- generateData(Beta = Beta, Theta = Theta, n = n, p = p, q = q,
rho = 0.1, missing.type = "MNAR")
## ---------------------------------------------------------------------
## Specifying just one of 'Beta' and 'Theta' is also allowed.
sim.dat <- generateData(Theta = Theta, n = n, p = p, q = q,
rho = 0.1, missing.type = "MNAR")
## User-supplied 'X', 'Beta' and 'E', in which case 'Y' is deterministic.
n <- 300; p <- 50; q <- 20
X <- matrix(rnorm(n*p, 0, 1), n, p)
Beta <- matrix(rnorm(p*q, 0, 1), p, q)
E <- mvtnorm::rmvnorm(n, rep(0, q), sigma = diag(q))
sim.dat <- generateData(X = X, Beta = Beta, E = E, n = n, p = p, q = q,
rho = 0.1, missing.type = "MCAR")