| data_sim {gratia} | R Documentation |
Simulate example data for fitting GAMs
Description
A tidy reimplementation of the functions implemented in mgcv::gamSim()
that can be used to fit GAMs. An new feature is that the sampling
distribution can be applied to all the example types.
Usage
data_sim(
model = "eg1",
n = 400,
scale = NULL,
theta = 3,
power = 1.5,
dist = c("normal", "poisson", "binary", "negbin", "tweedie", "gamma", "ocat",
"ordered categorical"),
n_cat = 4,
cuts = c(-1, 0, 5),
seed = NULL,
gfam_families = c("binary", "tweedie", "normal")
)
Arguments
model |
character; either |
n |
numeric; the number of observations to simulate. |
scale |
numeric; the level of noise to use. |
theta |
numeric; the dispersion parameter |
power |
numeric; the Tweedie power parameter. |
dist |
character; a sampling distribution for the response
variable. |
n_cat |
integer; the number of categories for categorical response.
Currently only used for |
cuts |
numeric; vector of cut points on the latent variable, excluding
the end points |
seed |
numeric; the seed for the random number generator. Passed to
|
gfam_families |
character; a vector of distributions to use in
generating data with grouped families for use with |
Details
data_sim() can simulate data from several underlying models of
known true functions. The available options currently are:
-
"eg1": a four term additive true model. This is the classic Gu & Wahba four univariate term test model. Seegw_functionsfor more details of the underlying four functions. -
"eg2": a bivariate smooth true model. -
"eg3": an example containing a continuous by smooth (varying coefficient) true model. The model is\hat{y}_i = f_2(x_{1i})x_{2i}where the functionf_2()isf_2(x) = 0.2 * x^{11} * (10 * (1 - x))^6 + 10 * (10 * x)^3 * (1 - x)^{10}. -
"eg4": a factor by smooth true model. The true model contains a factor with 3 levels, where the response for the nth level follows the nth Gu & Wabha function (forn \in {1, 2, 3}). -
"eg5": an additive plus factor true model. The response is a linear combination of the Gu & Wabha functions 2, 3, 4 (the latter is a null function) plus a factor term with four levels. -
"eg6": an additive plus random effect term true model. ´"eg7"
: a version of the model in "eg1"', but where the covariates are correlated.-
"gwf2": a model where the response is Gu & Wabha'sf_2(x_i)plus noise. -
"lwf6": a model where the response is Luo & Wabha's "example 6" functionsin(2(4x-2)) + 2 exp(-256(x-0.5)^2)plus noise. -
"gfam": simulates data for use with GAMs withfamily = gfam(families). See example inmgcv::gfam(). If this model is specified thendistis ignored andgfam_familiesis used to specify which distributions are included in the simulated data. Can be a vector of any of the families allowed bydist. For"ocat" %in% gfam_families(or"ordered categorical"), 4 classes are assumed, which can't be changed. Link functions used are"identity"for"normal","logit"for"binary","ocat", and"ordered categorical", and"exp"elsewhere.
The random component providing noise or sampling variation can follow one
of the distributions, specified via argument dist
-
"normal": Gaussian, -
"poisson": Poisson, -
"binary": Bernoulli, -
"negbin": Negative binomial, -
"tweedie": Tweedie, -
"gamma": gamma , and -
"ordered categorical": ordered categorical
Other arguments provide the parameters for the distribution.
References
Gu, C., Wahba, G., (1993). Smoothing Spline ANOVA with Component-Wise Bayesian "Confidence Intervals." J. Comput. Graph. Stat. 2, 97–117.
Luo, Z., Wahba, G., (1997). Hybrid adaptive splines. J. Am. Stat. Assoc. 92, 107–116.
Examples
data_sim("eg1", n = 100, seed = 1)
# an ordered categorical response
data_sim("eg1", n = 100, dist = "ocat", n_cat = 4, cuts = c(-1, 0, 5))