vgam {VGAM} | R Documentation |
Fitting Vector Generalized Additive Models
Description
Fit a vector generalized additive model (VGAM). Both 1st-generation VGAMs (based on backfitting) and 2nd-generation VGAMs (based on P-splines, with automatic smoothing parameter selection) are implemented. This is a large class of models that includes generalized additive models (GAMs) and vector generalized linear models (VGLMs) as special cases.
Usage
vgam(formula,
family = stop("argument 'family' needs to be assigned"),
data = list(), weights = NULL, subset = NULL,
na.action, etastart = NULL, mustart = NULL,
coefstart = NULL, control = vgam.control(...),
offset = NULL, method = "vgam.fit", model = FALSE,
x.arg = TRUE, y.arg = TRUE, contrasts = NULL,
constraints = NULL, extra = list(), form2 = NULL,
qr.arg = FALSE, smart = TRUE, ...)
Arguments
formula |
a symbolic description of the model to be fit.
The RHS of the formula is applied to each
linear/additive predictor,
and should include at least one
|
family |
Same as for |
data |
an optional data frame containing the variables
in the model.
By default the variables are taken from
|
weights , subset , na.action |
Same as for |
etastart , mustart , coefstart |
Same as for |
control |
a list of parameters for controlling the fitting process.
See |
method |
the method to be used in fitting the model.
The default (and presently only) method |
constraints , model , offset |
Same as for |
x.arg , y.arg |
logical values indicating whether the
model matrix and response
vector/matrix used in the fitting process
should be assigned in the
|
contrasts , extra , form2 , qr.arg , smart |
Same as for |
... |
further arguments passed into |
Details
A vector generalized additive model (VGAM)
is loosely defined
as a statistical model that is a function
of M
additive predictors.
The central formula is given by
\eta_j = \sum_{k=1}^p f_{(j)k}(x_k)
where x_k
is the k
th explanatory variable
(almost always x_1=1
for the intercept term),
and
f_{(j)k}
are smooth functions of x_k
that are estimated
by smoothers.
The first term in the summation is just the intercept.
Currently
two types of smoothers are
implemented:
s
represents
the older and more traditional one, called a
vector (cubic smoothing spline) smoother and is
based on Yee and Wild (1996);
it is more similar to the R package gam.
The newer one is represented by
sm.os
and
sm.ps
, and these are
based on O-splines and P-splines—they allow automatic
smoothing parameter selection; it is more similar
to the R package mgcv.
In the above, j=1,\ldots,M
where M
is finite.
If all the functions are constrained to be linear then
the resulting model is a vector generalized linear model
(VGLM). VGLMs are best fitted with vglm
.
Vector (cubic smoothing spline) smoothers are represented
by s()
(see s
). Local
regression via lo()
is not supported. The
results of vgam
will differ from the gam()
(in the gam) because vgam()
uses a different
knot selection algorithm. In general, fewer knots are
chosen because the computation becomes expensive when
the number of additive predictors M
is large.
Second-generation VGAMs are based on the
O-splines and P-splines.
The latter is due to Eilers and Marx (1996).
Backfitting is not required, and estimation is
performed using IRLS.
The function sm.os
represents a smart
implementation of O-splines.
The function sm.ps
represents a smart
implementation of P-splines.
Written G2-VGAMs or P-VGAMs, this methodology
should not be used
unless the sample size is reasonably large.
Usually an UBRE predictive criterion is optimized
(at each IRLS iteration)
because the
scale parameter for VGAMs is usually assumed to be known.
This search for optimal smoothing parameters
does not always converge,
and neither is it totally reliable.
G2-VGAMs implicitly set criterion = "coefficients"
so that
convergence occurs when the change in the
regression coefficients
between 2 IRLS iterations is sufficiently small.
Otherwise the search for the optimal
smoothing parameters might
cause the log-likelihood to decrease
between 2 IRLS iterations.
Currently outer iteration is implemented,
by default,
rather than performance iteration because the latter
is more easy to converge to a local solution; see
Wood (2004) for details.
One can use performance iteration
by setting Maxit.outer = 1
in
vgam.control
.
The underlying algorithm of VGAMs is IRLS.
First-generation VGAMs (called G1-VGAMs)
are estimated by modified vector backfitting
using vector splines. O-splines are used as
the basis functions
for the vector (smoothing) splines, which are
a lower dimensional
version of natural B-splines.
The function vgam.fit()
actually does the
work. The smoothing code is based on F. O'Sullivan's
BART code.
A closely related methodology based on VGAMs called
constrained additive ordination (CAO) first forms
a linear combination of the explanatory variables (called
latent variables) and then fits a GAM to these.
This is implemented in the function cao
for a very limited choice of family functions.
Value
For G1-VGAMs and G2-VGAMs, an object of class
"vgam"
or
"pvgam"
respectively
(see vgam-class
and pvgam-class
for further information).
WARNING
For G1-VGAMs,
currently vgam
can only handle
constraint matrices cmat
,
say, such that crossprod(cmat)
is diagonal.
It can be detected by is.buggy
.
VGAMs with constraint matrices that have
non-orthogonal columns should
be fitted with
sm.os
or
sm.ps
terms
instead of s
.
See warnings in vglm.control
.
Note
This function can fit a wide variety of
statistical models. Some of
these are harder to fit than others because
of inherent numerical
difficulties associated with some of them.
Successful model fitting
benefits from cumulative experience.
Varying the values of arguments
in the VGAM family function itself
is a good first step if
difficulties arise, especially if initial
values can be inputted.
A second, more general step, is to vary
the values of arguments in
vgam.control
.
A third step is to make use of arguments
such as etastart
,
coefstart
and mustart
.
Some VGAM family functions end in "ff"
to avoid interference with other functions, e.g.,
binomialff
, poissonff
.
This is because VGAM family functions
are incompatible with
glm
(and also
gam()
in gam and
gam
in mgcv).
The smart prediction (smartpred
) library
is packed with the VGAM library.
The theory behind the scaling parameter is currently being made more rigorous, but it it should give the same value as the scale parameter for GLMs.
Author(s)
Thomas W. Yee
References
Wood, S. N. (2004). Stable and efficient multiple smoothing parameter estimation for generalized additive models. J. Amer. Statist. Assoc., 99(467): 673–686.
Yee, T. W. and Wild, C. J. (1996). Vector generalized additive models. Journal of the Royal Statistical Society, Series B, Methodological, 58, 481–493.
Yee, T. W. (2008).
The VGAM
Package.
R News, 8, 28–39.
Yee, T. W. (2015). Vector Generalized Linear and Additive Models: With an Implementation in R. New York, USA: Springer.
Yee, T. W. (2016). Comments on “Smoothing parameter and model selection for general smooth models” by Wood, S. N. and Pya, N. and Safken, N., J. Amer. Statist. Assoc., 110(516).
See Also
is.buggy
,
vgam.control
,
vgam-class
,
vglmff-class
,
plotvgam
,
summaryvgam
,
summarypvgam
,
sm.os
,
sm.ps
,
s
,
magic
,
vglm
,
vsmooth.spline
,
cao
.
Examples
# Nonparametric proportional odds model
pneumo <- transform(pneumo, let = log(exposure.time))
vgam(cbind(normal, mild, severe) ~ s(let),
cumulative(parallel = TRUE), data = pneumo, trace = TRUE)
# Nonparametric logistic regression
hfit <- vgam(agaaus ~ s(altitude, df = 2), binomialff, hunua)
## Not run: plot(hfit, se = TRUE)
phfit <- predict(hfit, type = "terms", raw = TRUE, se = TRUE)
names(phfit)
head(phfit$fitted)
head(phfit$se.fit)
phfit$df
phfit$sigma
# Fit two species simultaneously
hfit2 <- vgam(cbind(agaaus, kniexc) ~ s(altitude, df = c(2, 3)),
binomialff(multiple.responses = TRUE), data = hunua)
coef(hfit2, matrix = TRUE) # Not really interpretable
## Not run:
plot(hfit2, se = TRUE, overlay = TRUE, lcol = 3:4, scol = 3:4)
ooo <- with(hunua, order(altitude))
with(hunua, matplot(altitude[ooo], fitted(hfit2)[ooo,],
ylim = c(0, 0.8), las = 1,type = "l", lwd = 2,
xlab = "Altitude (m)", ylab = "Probability of presence",
main = "Two plant species' response curves"))
with(hunua, rug(altitude))
## End(Not run)
# The 'subset' argument does not work here. Use subset() instead.
set.seed(1)
zdata <- data.frame(x2 = runif(nn <- 500))
zdata <- transform(zdata, y = rbinom(nn, 1, 0.5))
zdata <- transform(zdata, subS = runif(nn) < 0.7)
sub.zdata <- subset(zdata, subS) # Use this instead
if (FALSE)
fit4a <- vgam(cbind(y, y) ~ s(x2, df = 2),
binomialff(multiple.responses = TRUE),
data = zdata, subset = subS) # This fails!!!
fit4b <- vgam(cbind(y, y) ~ s(x2, df = 2),
binomialff(multiple.responses = TRUE),
data = sub.zdata) # This succeeds!!!
fit4c <- vgam(cbind(y, y) ~ sm.os(x2),
binomialff(multiple.responses = TRUE),
data = sub.zdata) # This succeeds!!!
## Not run: par(mfrow = c(2, 2))
plot(fit4b, se = TRUE, shade = TRUE, shcol = "pink")
plot(fit4c, se = TRUE, shade = TRUE, shcol = "pink")
## End(Not run)