questionnaire_gen {lsasim} | R Documentation |
Generation of ordinal and continuous variables
Description
Creates a data frame of discrete and continuous variables based on several arguments.
Usage
questionnaire_gen(
n_obs,
cat_prop = NULL,
n_vars = NULL,
n_X = NULL,
n_W = NULL,
cor_matrix = NULL,
cov_matrix = NULL,
c_mean = NULL,
c_sd = NULL,
theta = FALSE,
family = NULL,
full_output = FALSE,
verbose = TRUE
)
Arguments
n_obs |
number of observations to generate. |
cat_prop |
list of cumulative proportions for each item. If |
n_vars |
total number of variables in the questionnaire, including the
continuous and the discrete covariates ( |
n_X |
number of continuous background variables. If not provided, a random number of continuous variables will be generated. |
n_W |
either a scalar corresponding to the number of categorical background variables or a list of scalars representing the number of categories for each categorical variable. If not provided, a random number of categorical variables will be generated. |
cor_matrix |
latent correlation matrix. The first row/column corresponds
to the latent trait ( |
cov_matrix |
latent covariance matrix, formatted as |
c_mean |
is a vector of population means for each continuous variable
( |
c_sd |
is a vector of population standard deviations for each continuous
variable ( |
theta |
if |
family |
distribution of the background variables. Can be NULL (default) or 'gaussian'. |
full_output |
if |
verbose |
if 'FALSE', output messages will be suppressed (useful for simulations). Defaults to 'TRUE' |
Details
In essence, this function begins by checking the validity of the
arguments provided and randomly generating those that are not. Then, it
will call one of two internal functions,
questionnaire_gen_polychoric
or questionnaire_gen_family
. The
former corresponds to the exact functionality of questionnaire_gen on
lsasim 1.0.1, where the polychoric correlations are used to generate the
background questionnaire data. If family != NULL
, however,
questionnaire_gen_family
is called to generate data based on a joint
probability distribution. Additionally, if full_output == TRUE
, the
external function beta_gen
is called to generate the correlation
coefficients based on the true covariance matrix. The latter argument also
changes the class of the output of this function.
What follows are some notes on the input parameters.
cat_prop
is a list where length(cat_prop)
is the number of
items to be generated. Each element of the list is a vector containing the
marginal cumulative proportions for each category, summing to 1. For
continuous items, the associated element in the list should be 1.
cor_matrix
and cov_matrix
are the correlation and covariance
matrices that are the same size as length(cat_prop)
. The
correlations related to the correlation between variables on the latent
scale.
c_mean and c_sd
are each vectors whose length is equal to the number
of continuous variables as specified by cat_prop
. The default is to
keep the continuous variables with mean zero and standard deviation of one.
theta
is a logical indicator that determines if the first continuous
item should be labeled theta. If theta == TRUE
but there are
no continuous variables generated, a random number of background variables
will be generated.
If cat_prop
is a named list, those names will be used as variable
names for the returned data.frame
. Generic names will be provided
to the variables if cat_prop
is not named.
As an alternative to providing cat_prop
, the user can call this
function by specifying the total number of variables using n_vars
or
the specific number of continuous and categorical variables through
n_X
and n_W
. All three arguments should be provided as
scalars; n_W
may also be provided as a list, where each element
contains the number of categories for one background variable.
Alternatively, n_W
may be provided as a one-element list, in which
case it will be interpreted as all the categorical variables having the
same number of categories.
If family == "gaussian"
, the questionnaire will be generated
assuming that all the variables are jointly-distributed as a multivariate
normal. The default behavior is family == NULL
, where the data is
generated using the polychoric correlation matrix, with no distributional
assumptions.
When data is generated using the Gaussian distribution, the matrices
provided correspond to the relations between the latent variable
\theta
, the continuous covariates X
and the continuous
covariates—Z ~ N(0, 1)
—that will later be discretized into
categorical covariates W
. That is why there will be a difference
between labels and lengths between cov_matrix
and vcov_YXW
.
For more information, check the references cited later in this document.
Value
By default, the function returns a data.frame
object where the
first column ("subject") is a 1,\ldots,n
ordered list of the n
observations and the other columns correspond to the questionnaire answers.
If theta = TRUE
, the first column after "subject" will be the latent
variable \theta
; in any case, the continuous variables always come
before the categorical ones.
If full_output = TRUE
, the output will be a list containing the
following objects:
bg |
a data frame containing the background questionnaire answers (i.e., the same object as described above). |
c_mean |
identical to the input argument of the same name. Read the Details section for more information. |
c_sd |
identical to the input argument of the same name. Read the Details section for more information. |
cat_prop |
identical to the input argument of the same name. Read the Details section for more information. |
cat_prop_W_p |
a list containing the probabilities for each category
of the categorical variables ( |
cor_matrix |
identical to the input argument of the same name. Read the Details section for more information. |
cov_matrix |
identical to the input argument of the same name. Read the Details section for more information. |
family |
identical to the input argument of the same name. |
n_obs |
identical to the input argument of the same name. |
n_tot |
named vector containing the number of total variables, the
number of continuous background variables (i.e., the total number of
background variables except |
n_W |
vector containing the number of categorical variables. |
n_X |
vector containing the number of continuous variables (except
|
sd_YXW |
vector with the standard deviations of all the variables |
sd_YXZ |
vector containing the standard deviations of |
theta |
identical to the input argument of the same name. |
var_W |
list containing the variances of the categorical variables. |
var_YX |
list containing the variances of the continuous variables
(including |
linear_regression |
This list is printed only if 'theta = TRUE',
'family = "gaussian"' and 'full_output = TRUE'. It contains one vector
named 'betas' and one tabled named 'cov_YXW'. The former displays the true
linear regression coefficients of |
Note
If family == NULL
, the number of levels for each categorical
variables will be determined by the number of categories observed in the
generated data. This means it might be smaller than the number of
categories determined by cat_prop
, which is more likely to happen
with small values of n_obs
. If family == "gaussian"
, however,
the number of levels for the categorical variables will always be
equivalent to the number of possible categories, even if they are not
observed in the data.
It is important to note that all arguments directly related to variable parameters (e.g. 'cat_prop', 'cov_matrix', 'cor_matrix', 'c_mean', 'c_sd') have the following order: Y, X, W (missing variables are skipped). This must be kept in mind when using real-life data as input to 'questionnaire_gen', as the input might need to be reordered to fit the expectations of the function.
By definition, the expected order of the variables is theta
,
followed by X
and then W
. The reference category of the
categorical variables W
is always the first one.
For very small means/sigmas (e.g. 0.005) and multiple levels, estimates may have differing levels of accuracy (e.g. school level estimates will not be as accurate as the student levels ones). In general, one should expect naturally worse estimation on higher hierarchical setups.
References
Matta, T. H., Rutkowski, L., Rutkowski, D., & Liaw, Y. L. (2018). lsasim: an R package for simulating large-scale assessment data. Large-scale Assessments in Education, 6(1), 15.
See Also
beta_gen
Examples
# Using polychoric correlations
props <- list(c(1), c(.25, .6, 1)) # one continuous, one with 3 categories
questionnaire_gen(n_obs = 10, cat_prop = props,
cor_matrix = matrix(c(1, .6, .6, 1), nrow = 2),
c_mean = 2, c_sd = 1.5, theta = TRUE)
# Using the multinomial distribution
# two categorical variables W: one has 2 categories, the other has 3
props <- list(1, c(.25, 1), c(.2, .8, 1))
yw_cov <- matrix(c(1, .5, .5, .5, 1, .8, .5, .8, 1), nrow = 3)
questionnaire_gen(n_obs = 10, cat_prop = props, cov_matrix = yw_cov,
family = "gaussian")
# Not providing covariance matrix
questionnaire_gen(n_obs = 10,
cat_prop = list(c(.25, 1), c(.6, 1), c(.2, 1)),
family = "gaussian")