data_sim {gratia} | R Documentation |
Simulate example data for fitting GAMs
Description
A tidy reimplementation of the functions implemented in mgcv::gamSim()
that can be used to fit GAMs. An new feature is that the sampling
distribution can be applied to all the example types.
Usage
data_sim(
model = "eg1",
n = 400,
scale = NULL,
theta = 3,
power = 1.5,
dist = c("normal", "poisson", "binary", "negbin", "tweedie", "gamma", "ocat",
"ordered categorical"),
n_cat = 4,
cuts = c(-1, 0, 5),
seed = NULL,
gfam_families = c("binary", "tweedie", "normal")
)
Arguments
model |
character; either |
n |
numeric; the number of observations to simulate. |
scale |
numeric; the level of noise to use. |
theta |
numeric; the dispersion parameter |
power |
numeric; the Tweedie power parameter. |
dist |
character; a sampling distribution for the response
variable. |
n_cat |
integer; the number of categories for categorical response.
Currently only used for |
cuts |
numeric; vector of cut points on the latent variable, excluding
the end points |
seed |
numeric; the seed for the random number generator. Passed to
|
gfam_families |
character; a vector of distributions to use in
generating data with grouped families for use with |
Details
data_sim()
can simulate data from several underlying models of
known true functions. The available options currently are:
-
"eg1"
: a four term additive true model. This is the classic Gu & Wahba four univariate term test model. Seegw_functions
for more details of the underlying four functions. -
"eg2"
: a bivariate smooth true model. -
"eg3"
: an example containing a continuous by smooth (varying coefficient) true model. The model is\hat{y}_i = f_2(x_{1i})x_{2i}
where the functionf_2()
isf_2(x) = 0.2 * x^{11} * (10 * (1 - x))^6 + 10 * (10 * x)^3 * (1 - x)^{10}
. -
"eg4"
: a factor by smooth true model. The true model contains a factor with 3 levels, where the response for the nth level follows the nth Gu & Wabha function (forn \in {1, 2, 3}
). -
"eg5"
: an additive plus factor true model. The response is a linear combination of the Gu & Wabha functions 2, 3, 4 (the latter is a null function) plus a factor term with four levels. -
"eg6"
: an additive plus random effect term true model. ´"eg7"
: a version of the model in
"eg1"', but where the covariates are correlated.-
"gwf2"
: a model where the response is Gu & Wabha'sf_2(x_i)
plus noise. -
"lwf6"
: a model where the response is Luo & Wabha's "example 6" functionsin(2(4x-2)) + 2 exp(-256(x-0.5)^2)
plus noise. -
"gfam"
: simulates data for use with GAMs withfamily = gfam(families)
. See example inmgcv::gfam()
. If this model is specified thendist
is ignored andgfam_families
is used to specify which distributions are included in the simulated data. Can be a vector of any of the families allowed bydist
. For"ocat" %in% gfam_families
(or"ordered categorical"
), 4 classes are assumed, which can't be changed. Link functions used are"identity"
for"normal"
,"logit"
for"binary"
,"ocat"
, and"ordered categorical"
, and"exp"
elsewhere.
The random component providing noise or sampling variation can follow one
of the distributions, specified via argument dist
-
"normal"
: Gaussian, -
"poisson"
: Poisson, -
"binary"
: Bernoulli, -
"negbin"
: Negative binomial, -
"tweedie"
: Tweedie, -
"gamma"
: gamma , and -
"ordered categorical"
: ordered categorical
Other arguments provide the parameters for the distribution.
References
Gu, C., Wahba, G., (1993). Smoothing Spline ANOVA with Component-Wise Bayesian "Confidence Intervals." J. Comput. Graph. Stat. 2, 97–117.
Luo, Z., Wahba, G., (1997). Hybrid adaptive splines. J. Am. Stat. Assoc. 92, 107–116.
Examples
data_sim("eg1", n = 100, seed = 1)
# an ordered categorical response
data_sim("eg1", n = 100, dist = "ocat", n_cat = 4, cuts = c(-1, 0, 5))