simContinuous {simPop} | R Documentation |
Simulate continuous variables of population data
Description
Simulate continuous variables of population data using multinomial log-linear models combined with random draws from the resulting categories or (two-step) regression models combined with random error terms. The household structure of the population data and any other categorical predictors need to be simulated beforehand.
Usage
simContinuous(
simPopObj,
additional = "netIncome",
method = c("multinom", "lm", "poisson", "xgboost"),
zeros = TRUE,
breaks = NULL,
lower = NULL,
upper = NULL,
equidist = TRUE,
probs = NULL,
gpd = TRUE,
threshold = NULL,
est = "moments",
limit = NULL,
censor = NULL,
log = TRUE,
const = NULL,
alpha = 0.01,
residuals = TRUE,
keep = TRUE,
maxit = 500,
MaxNWts = 1500,
tol = .Machine$double.eps^0.5,
nr_cpus = NULL,
eps = NULL,
regModel = "basic",
byHousehold = NULL,
imputeMissings = FALSE,
seed,
verbose = FALSE,
by = "strata",
model_params = NULL
)
Arguments
simPopObj |
a |
additional |
a character string specifying the additional continuous
variable of |
method |
a character string specifying the method to be used for
simulating the continuous variable. Accepted values are |
zeros |
a logical indicating whether the variable specified by
|
breaks |
an optional numeric vector; if multinomial models are
computed, this can be used to supply two or more break points for
categorizing the variable specified by |
lower , upper |
optional numeric values; if multinomial models are
computed and |
equidist |
logical; if |
probs |
numeric vector with values in |
gpd |
logical; if |
threshold |
a numeric value; if |
est |
a character string; if |
limit |
an optional named list of lists; if multinomial models are computed, this can be used to account for structural zeros. The names of the list components specify the predictor variables for which to limit the possible outcomes of the response. For each predictor, a list containing the possible outcomes of the response for each category of the predictor can be supplied. The probabilities of other outcomes conditional on combinations that contain the specified categories of the supplied predictors are set to 0. Currently, this is only implemented for more than two categories in the response. |
censor |
an optional named list of lists or |
log |
logical; if |
const |
numeric; if |
alpha |
numeric; if |
residuals |
logical; if |
keep |
logical; if multinomial models are computed, this indicates
whether the simulated categories should be stored as a variable in the
resulting population data. If |
maxit , MaxNWts |
control parameters to be passed to
|
tol |
if |
nr_cpus |
if specified, an integer number defining the number of cpus that should be used for parallel processing. |
eps |
a small positive numeric value, or |
regModel |
allows to specify the model that should be for the simulation of the additional continuous variable. The following choices are possible:
|
byHousehold |
if NULL, simulated values are used as is. If either |
imputeMissings |
if TRUE, missing values in variables that are used for the underlying model are imputed using hock-deck. |
seed |
optional; an integer value to be used as the seed of the random number generator, or an integer vector containing the state of the random number generator to be restored. |
verbose |
(logical) if |
by |
defining which variable to use as split up variable of the estimation. Defaults to the strata variable. |
model_params |
adding optional parameter to the model, at the moment only implemented for xgboost hyperparameters |
Details
If method
is "lm"
, the behavior for two-step models is
described in the following.
If zeros
is TRUE
and log
is not TRUE
or the
variable specified by additional
does not contain negative values, a
log-linear model is used to predict whether an observation is zero or not.
Then a linear model is used to predict the non-zero values.
If zeros
is TRUE
, log
is TRUE
and const
is specified, again a log-linear model is used to predict whether an
observation is zero or not. In the linear model to predict the non-zero
values, const
is added to the variable specified by additional
before the logarithms are taken.
If zeros
is TRUE
, log
is TRUE
, const
is
NULL
and there are negative values, a multinomial log-linear model is
used to predict negative, zero and positive observations. Categories for the
negative values are thereby defined by breaks
. In the second step, a
linear model is used to predict the positive values and negative values are
drawn from uniform distributions in the respective classes.
If zeros
is FALSE
, log
is TRUE
and const
is NULL
, a two-step model is used if there are non-positive values in
the variable specified by additional
. Whether a log-linear or a
multinomial log-linear model is used depends on the number of categories to
be used for the non-positive values, as defined by breaks
. Again,
positive values are then predicted with a linear model and non-positive
values are drawn from uniform distributions.
The number of cpus are selected automatically in the following manner. The number of cpus is equal the number of strata. However, if the number of cpus is less than the number of strata, the number of cpus - 1 is used by default. This should be the best strategy, but the user can also overwrite this decision.
Value
An object of class simPopObj
containing survey
data as well as the simulated population data including the continuous
variable specified by additional
and possibly simulated categories
for the desired continous variable.
Note
The basic household structure and any other categorical predictors
need to be simulated beforehand with the functions
simStructure
and simCategorical
, respectively.
Author(s)
Bernhard Meindl, Andreas Alfons, Alexander Kowarik (based on code by Stefan Kraft), Siro Fritzmann
References
B. Meindl, M. Templ, A. Kowarik, O. Dupriez (2017) Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information. Journal of Statistical Survey, 79 (10), 1–38. doi:10.18637/jss.v079.i10
A. Alfons, M. Templ (2011) Simulation of close-to-reality population data for household surveys with application to EU-SILC. Statistical Methods & Applications, 20 (3), 383–407. doi:10.1080/02664763.2013.859237
See Also
simStructure
, simCategorical
,
simComponents
, simEUSILC
Examples
data(eusilcS)
## Not run:
## approx. 20 seconds computation time
inp <- specifyInput(data=eusilcS, hhid="db030", hhsize="hsize", strata="db040", weight="db090")
simPop <- simStructure(data=inp, method="direct",
basicHHvars=c("age", "rb090", "hsize", "pl030", "pb220a"))
regModel = ~rb090+hsize+pl030+pb220a
# multinomial model with random draws
eusilcM <- simContinuous(simPop, additional="netIncome",
regModel = regModel,
upper=200000, equidist=FALSE, nr_cpus=1)
class(eusilcM)
# two-step regression
eusilcT <- simContinuous(simPop, additional="netIncome",
regModel = "basic",
method = "lm", nr_cpus=1)
class(eusilcT)
## End(Not run)