generateData {missoNet} | R Documentation |
Quickly generate synthetic data for simulation studies
Description
The ‘generateData
’ function is used to readily produce synthetic data with randomly/systematically-missing values from a conditional Gaussian graphical model.
This function supports three types of missing mechanisms that can be specified by users – missing completely at random (MCAR), missing at random (MAR) and
missing not at random (MNAR).
Usage
generateData(
X = NULL,
Beta = NULL,
E = NULL,
Theta = NULL,
Sigma.X = NULL,
n,
p,
q,
rho,
missing.type = "MCAR",
Beta.row.sparsity = 0.2,
Beta.elm.sparsity = 0.2,
with.seed = NULL
)
Arguments
X |
(Optional) a user-supplied predictor matrix ( |
Beta |
(Optional) a user-supplied regression coefficient matrix |
E |
(Optional) a user-supplied error matrix ( |
Theta |
(Optional) a user-supplied positive definite precision (inverse covariance) matrix |
Sigma.X |
(Optional) A user-supplied positive definite covariance matrix |
n |
Sample size. |
p |
The dimensionality of the predictors. |
q |
The dimensionality of the responses. |
rho |
A scalar or a numeric vector of length |
missing.type |
Character string: can be " |
Beta.row.sparsity |
A Bernoulli parameter between 0 and 1 controlling the approximate proportion of rows where at least one element could be nonzero in |
Beta.elm.sparsity |
A Bernoulli parameter between 0 and 1 controlling the approximate proportion of nonzero elements among the rows of |
with.seed |
A random number seed for the generative process. |
Details
The dataset is simulated through the following steps:
If
'X = NULL'
(default), the function ‘MASS::mvrnorm(n, mean = rep(0, p), sigma = Sigma.X)
’ is used to simulate'n'
samples from a'p'
-variate Gaussian distribution for generating a predictor matrix'X'
;If
'Beta = NULL'
(default), the function ‘stats::rnorm(p*q, 0, 1)
’ is used to fill an empty (p \times q
) dimensional matrix'Beta'
, of which the row sparsity and element sparsity are later controlled by the auxiliary arguments'Beta.row.sparsity'
and'Beta.elm.sparsity'
, respectively;If
'E = NULL'
(default), the function ‘MASS::mvrnorm(n, mean = rep(0, q), sigma = solve(Theta))
’ is used to simulate'n'
samples from a'q'
-variate Gaussian distribution for generating an error matrix'E'
;A complete response matrix
'Y'
without missing values is then generated by the command'Y = X %*% Beta + E'
;To get a response matrix
'Z'
:=f
('Y'
) corrupted by missing data, the values in'Y'
are partially replaced with'NA'
s following the strategy specified by the arguments'missing.type'
and'rho'
.
To better illustrate the step 5 above, suppose for all i = 1,...,n
and j = 1,...,q
: 'Y[i, j]'
is replaced with 'NA'
if 'M[i, j] == 1'
, where 'M'
is an indicator matrix of missingness having the same dimension as 'Y'
.
The value of 'M[i, j]'
is partially controlled by the arguments 'missing.type'
and 'rho'
.
Below we sum up the three built-in missing mechanisms supported by the ‘generateData
’ function:
-
'missing.type'
== "MCAR
":'Y[i, j] <- NA'
if'M[i, j] == 1'
, where'M[i, j] = rbinom(0, rho[j])'
; -
'missing.type'
== "MAR
":'Y[i, j] <- NA'
if'M[i, j] == 1'
, where'M[i, j] = rbinom(0, (rho[j] * c / (1 + exp(-(X %*% Beta)[i, j]))))'
, in whichc
is a constant correcting the missing rate of thej
th column of'Y'
to'rho[j]'
; -
'missing.type'
== "MNAR
":'Y[i, j] <- NA'
if'M[i, j] == 1'
, where'M[i, j] = 1 * (Y[i, j] < Tj)'
, in which'Tj = quantile(Y[ , j], rho[j])'
.
Of the aforementioned missing mechanisms, "MCAR
" is random, and the other two are systematic.
under "MCAR
", 'M[i, j]'
is not related to 'Y'
or to 'X'
;
under "MAR
", 'M[i, j]'
is related to 'X'
, but not related to 'Y'
after 'X'
is controlled;
under "MNAR
", 'M[i, j]'
is related to 'Y'
itself, even after 'X'
is controlled.
Value
This function returns a 'list'
consisting of the following components:
X |
A simulated (or the user-supplied) predictor matrix ( |
Y |
A simulated response matrix without missing values ( |
Z |
A simulated response matrix with missing values coded as |
Beta |
The regression coefficient matrix |
Theta |
The precision matrix |
rho |
A vector of length |
missing.type |
Character string: the type of missing mechanism used to generate missing values in the response matrix. |
Author(s)
Yixiao Zeng yixiao.zeng@mail.mcgill.ca, Celia M.T. Greenwood and Archer Yi Yang.
Examples
## Simulate a dataset with response values missing completely at random (MCAR),
## the overall missing rate is around 10%.
sim.dat <- generateData(n = 300, p = 50, q = 20, rho = 0.1, missing.type = "MCAR")
## -------------------------------------------------------------------------------
## Fit a missoNet model using the simulated dataset.
X <- sim.dat$X # predictor matrix
Y <- sim.dat$Z # corrupted response matrix
fit <- missoNet(X = X, Y = Y, lambda.Beta = 0.1, lambda.Theta = 0.1)
## Simulate a dataset with response values missing at random (MAR), the approximate
## missing rate for each column of the response matrix is specified through a vector 'rho'.
##
## The row sparsity and element sparsity of the auto-generated 'Beta' could be
## adjusted correspondingly by using 'Beta.row.sparsity' and 'Beta.elm.sparsity'.
n <- 300; p <- 50; q <- 20
rho <- runif(q, min = 0, max = 0.2)
sim.dat <- generateData(n = n, p = p, q = q, rho = rho, missing.type = "MAR",
Beta.row.sparsity = 0.3, Beta.elm.sparsity = 0.2)
## Simulate a dataset with response values missing not at random (MNAR),
## using the user-supplied 'Beta' and 'Theta'.
n <- 300; p <- 50; q <- 20
Beta <- matrix(rnorm(p*q, 0, 1), p, q) # a nonsparse 'Beta' (p x q)
Theta <- diag(q) # a diagonal 'Theta' (q x q)
sim.dat <- generateData(Beta = Beta, Theta = Theta, n = n, p = p, q = q,
rho = 0.1, missing.type = "MNAR")
## ---------------------------------------------------------------------
## Specifying just one of 'Beta' and 'Theta' is also allowed.
sim.dat <- generateData(Theta = Theta, n = n, p = p, q = q,
rho = 0.1, missing.type = "MNAR")
## User-supplied 'X', 'Beta' and 'E', in which case 'Y' is deterministic.
n <- 300; p <- 50; q <- 20
X <- matrix(rnorm(n*p, 0, 1), n, p)
Beta <- matrix(rnorm(p*q, 0, 1), p, q)
E <- mvtnorm::rmvnorm(n, rep(0, q), sigma = diag(q))
sim.dat <- generateData(X = X, Beta = Beta, E = E, n = n, p = p, q = q,
rho = 0.1, missing.type = "MCAR")