rdata.frame {mi} | R Documentation |
Generate a random data.frame with tunable characteristics
Description
This function generates a random data.frame
with a
missingness mechanism that is used to impose a missingness pattern. The primary
purpose of this function is for use in simulations
Usage
rdata.frame(N = 1000,
restrictions = c("none", "MARish", "triangular", "stratified", "MCAR"),
last_CPC = NA_real_, strong = FALSE, pr_miss = .25, Sigma = NULL,
alpha = NULL, experiment = FALSE,
treatment_cor = c(rep(0, n_full - 1), rep(NA, 2 * n_partial)),
n_full = 1, n_partial = 1, n_cat = NULL,
eta = 1, df = Inf, types = "continuous", estimate_CPCs = TRUE)
Arguments
N |
integer indicating the number of observations |
restrictions |
character string indicating what restrictions to impose on the the missing data mechansim, see the Details section |
last_CPC |
a numeric scalar between |
strong |
Integer among 0, 1, and 2 indicating how strong to
make the instruments with multiple partially observed variables,
in which case the missingness indicators for each partially observed variable
can be used as instruments when predicting missingness on other partially
observed variables. Only applies when |
pr_miss |
numeric scalar on the (0,1) interval or vector
of length |
Sigma |
Either |
alpha |
Either |
experiment |
logical indicating whether to simulate a randomized experiment |
treatment_cor |
Numeric vector of appropriate length indicating the
correlations between the treatment variable and the other variables, which
is only relevant if |
n_full |
integer indicating the number of fully observed variables |
n_partial |
integer indicating the number of partially observed variables |
n_cat |
Either |
eta |
Positive numeric scalar which serves as a hyperparameter in the data-generating process. The default value of 1 implies that the correlation matrix among the variables is jointly uniformally distributed, using essentially the same logic as in the clusterGeneration package |
df |
positive numeric scalar indicating the degress of freedom for the
(possibly skewed) multivariate t distribution, which defaults to
|
types |
a character vector (possibly of length one, in which case it
is recycled) indicating the type for each fully observed and partially
observed variable, which currently can be among |
estimate_CPCs |
A logical indicating whether the canonical partial correlations
between the partially observed variables and the latent missingnesses should
be estimated. The default is |
Details
By default, the correlation matrix among the variables and missingness indicators
is intended to be close to uniform, although it is often not possible to achieve
exactly. If restrictions = "none"
, the data will be Not Missing At Random
(NMAR). If restrictions = "MARish"
, the departure from Missing At Random
(MAR) will be minimized via a call to optim
, but generally will
not fully achieve MAR. If restrictions = "triangular"
, the MAR assumption
will hold but the missingness of each partially observed variable will only
depend on the fully observed variables and the other latent missingness indicators.
If restrictions = "stratified"
, the MAR assumption will hold but the
missingness of each partially observed variable will only depend on the fully
observed variables. If restrictions = "MCAR"
, the Missing Completely At
Random (MCAR) assumption holds, which is much more restrictive than MAR.
There are some rules to follow, particularly when specifying types
.
First, if experiment = TRUE
, there must be exactly one treatment
variable (taken to be binary) and it must come first to ensure that the
elements of treatment_cor
are handled properly. Second, if there are any
partially observed nominal variables, they must come last; this is to ensure
that they are conditionally uncorrelated with each other. Third, fully observed
nominal variables are not supported, but they can be made into ordinal variables
and then converted to nominal after the fact. Fourth, including both ordinal and
nominal partially observed variables is not supported yet, Finally, if any
variable is specified as a count, it will not be exactly consistent with the
data-generating process. Essentially, a count variable is constructed from a
continuous variable by evaluating pt
on it and passing that to
qpois
with an intensity parameter of 5. The other non-continuous
variables are constructed via some transformation or discretization of a continuous
variable.
If some partially observed variables are either ordinal or nominal (but not both),
then the n_cat
argument governs how many categories there are. If n_cat
is NULL
, then the number of categories defaults to three. If
n_cat
has length one, then that number of categories will be used for all
categorical variables but must be greater than two. Otherwise, the length of
n_cat
must match the number of partially observed categorical variables and
the number of categories for the i
th such variable will be the i
th element
of n_cat
.
Value
A list with the following elements:
true a
data.frame
containing noNA
valuesobs a
data.frame
derived from the previous with someNA
values that represents a dataset that could be observedempirical_CPCs a numeric vector of empirical Canonical Partial Correlations, which should differ only randomly from zero iff
MAR = TRUE
and the data-generating process is multivariate normalL a Cholesky factor of the correlation matrix used to generate the true data
In addition, if alpha
is not NULL
, then the following
elements are also included:
alpha the
alpha
vector utilizedsn_skewness the skewness of the multivariate skewed normal distribution in the population; note that this value is only an approximation of the skewness when
df < Inf
sn_kurtosis the kurtosis of the multivariate skewed normal distribution in the population; note that this value is only an approximation of the kurtosis when
df < Inf
Author(s)
Ben Goodrich and Jonathan Kropko, for this version, based on earlier versions written by Yu-Sung Su, Masanao Yajima, Maria Grazia Pittau, Jennifer Hill, and Andrew Gelman.
See Also
data.frame
, missing_data.frame
Examples
rdf <- rdata.frame(n_partial = 2, df = 5, alpha = rnorm(5))
print(rdf$empirical_CPCs) # not zero
rdf <- rdata.frame(n_partial = 2, restrictions = "triangular", alpha = NA)
print(rdf$empirical_CPCs) # only randomly different from zero
print(rdf$L == 0) # some are exactly zero by construction
mdf <- missing_data.frame(rdf$obs)
show(mdf)
hist(mdf)
image(mdf)
# a randomized experiment
rdf <- rdata.frame(n_full = 2, n_partial = 2,
restrictions = "triangular", experiment = TRUE,
types = c("t", "ord", "con", "pos"),
treatment_cor = c(0, 0, NA, 0, NA))
Sigma <- tcrossprod(rdf$L)
rownames(Sigma) <- colnames(Sigma) <- c("treatment", "X_2", "y_1", "Y_2",
"missing_y_1", "missing_Y_2")
print(round(Sigma, 3))