| simulate_dataset {WoodSimulatR} | R Documentation |
Generate an artificial dataset with correlated variables
Description
Generate an artificial dataset with correlated variables and defined means and standard deviations.
Usage
simulate_dataset(
n = 5000,
subsets = 4,
random_seed = NULL,
simbase = WoodSimulatR::ws_t_logf,
loadtype = NULL,
...,
RNGversion = "3.6.0"
)
Arguments
n |
Number of rows in the dataset |
subsets |
Either |
random_seed |
Allows to set an integer seed value for the random number
generator to achieve reproducible results
(see also |
simbase |
An object of class |
loadtype |
For passing on to |
... |
arguments passed on to |
RNGversion |
In |
Details
In the package WoodSimulatR, a number of predefined base values for simulation
are stored – see simbase.
Using a character vector for the argument subsets leads to subsets
as equal in size as possible.
The argument subsets enables differing means and standard deviations
for different subsamples. There are several possible usages:
If
subsets = NULL, the information about means and standard deviations is taken from thesimbase. There can still be different means and standard deviations ifsimbaseis an object of classsimbase_list.If a numeric vector or a character vector, it is used as argument
countryin an internal call toget_subsample_definitions.If a dataset, there are the following requirements:
-
identifier columns: The dataset has to have one or more discrete-valued identifier columns (usually character vectors or factors) which uniquely identify each row. These identifier columns are named
"country"and"subsample"in the standard case as yielded byget_subsample_definitions. In the general case, the identifier columns are detected as those columns which are not namedshare, species, loadtypeorliteratureand which do not end in_meanor_sd. If the argumentsimbaseis of classsimbase_list, further restrictions apply (see below). -
means and standard deviations: For at least one of the variables defined in the
simbase, also the mean and the standard deviation need to be given in each row; the column names for this data must be the name of the respective variable(s) from thesimbase, suffixed by_meanand_sd, respectively. -
optional: A column
sharecan be used to create subsamples of different sizes proportional to the values inshare.
-
The argument simbase can be either an object of class
simbase_covar or of class simbase_list.
various predefined
simbase_covarobjects are available inWoodSimulatR– seesimbase.for objects of class
simbase_list, additional restrictions apply:the object may only have grouping variable(s) which are also identifier columns according to the
subsetsdefinition above – if thesubsetsargument is not a data frame, the identifier columns are "country" and "subsample".The value combinations in the identifier columns have to match those which the
subsetsargument leads to (see alsoget_subsample_definitions).
Both the means and standard deviations in the subsample definitions
(see get_subsample_definitions) as well as the values in the
simbase depend on the way the destructive testing of the sawn timber was
done. If the simbase has a field loadtype
(see also simbase_covar), this value is used in the call to
get_subsample_definitions. Otherwise, the loadtype has to be
passed directly to the present function unless no call to
get_subsample_definitions is necessary (this depends on the
value of subsets – see above). If a loadtype has been defined, a variable
loadtype is also created in the resulting dataset for reference.
Negative values in any numeric column of the result dataset are forced to zero.
If random_seed is not NULL, reproducibility of results
is enforced by using set.seed with arguments
kind='Mersenne-Twister' and normal.kind='Inversion',
and by calling RNGversion with argument RNGversion.
If random_seed is not NULL, the random number generator
is reset at the end of the function using set.seed(NULL) and
RNGversion(toString(getRversion())).
Examples
simulate_dataset(n = 10, subsets = 1, random_seed = 1)
# As the loadtype is defined in the simbase, the argument loadtype is ignored
# with a warning
simulate_dataset(n = 10, subsets = 1, random_seed = 1, loadtype = 'be')
# Two subsamples
simulate_dataset(n = 10, subsets = 2, random_seed = 1)
# Two subsamples from pre-defined countries
simulate_dataset(n = 10, subsets = c('at', 'de'), random_seed = 1)
# Two subsamples from pre-defined countries with different sample sizes
simulate_dataset(n = 10, subsets = c(at = 3, de = 2), random_seed = 1)