geoGAM {geoGAM} | R Documentation |
Select sparse geoadditive model
Description
Selects a parsimonious geoadditive model from a large set of covariates with the aim of (spatial) prediction.
Usage
geoGAM(response, covariates = names(data)[!(names(data) %in% c(response,coords))],
data, coords = NULL, weights = rep(1, nrow(data)),
offset = TRUE, max.stop = 300, non.stationary = FALSE,
sets = NULL, seed = NULL, validation.data = NULL,
verbose = 0, cores = min(detectCores(),10))
Arguments
response |
name of response as character. Responses currently supported: gaussian, binary, ordered. |
covariates |
character vector of all covariates (factor, continuous). If not given, all columns of |
data |
data frame containing response, coordinates and covariates. |
coords |
character vector of column names indicating spatial coordinates. |
weights |
weights used for model fitting. |
offset |
logical, use offset for component wise gradient boosting algorithm. |
max.stop |
maximal number of boosting iterations. |
non.stationary |
logical, include non-stationary effects in model selection. This allows for spatial varying coefficients for continuous covariates, but increases computational effort. |
sets |
give predefined cross validation sets. |
seed |
set random seed for splitting of the cross validation sets, if no |
validation.data |
data frame containing response, coordinates and covariates to compute independent validation statistics. This data set is used to calculate predictive performance at the end of model selection only. |
verbose |
Should screen output be generated? 0 = none, >0 create output. |
cores |
number of cores to be used for parallel computing |
Details
Summary
geoGAM
models smooth nonlinear relations between responses and single covariates and combines these model terms additively. Residual spatial autocorrelation is captured by a smooth function of spatial coordinates and nonstationary effects are included by interactions between covariates and smooth spatial functions. The core of fully automated model building for geoGAM is componentwise gradient boosting. The model selection procedures aims at obtaining sparse models that are open to check feasibilty of modelled relationships (Nussbaum et al. 2017a).
geoGAM
to date models continuous, binary and ordinal responses. It is able to cope with numerous continuous and categorical covariates.
Generic model representation
GAM expand the (possibly transformed) conditional
expectation of a response at given covariates s
as an additive series
g\left(\rule{0pt}{14pt}\mathrm{E}[Y(\mathbf{s})\,|\,\mathbf{x}(\mathbf{s})]\right)
= \nu + f(\mathbf{x}(\mathbf{s}))
= \nu + \sum_{j} f_{j}(x_{j}(\mathbf{s})),
where \nu
is a constant and f_{j}(x_{j}(\mathbf{s}))
are linear
terms or unspecified “smooth” nonlinear functions of single covariates x_{j}(\mathbf{s})
(e.g. smoothing spline, kernel or
any other scatterplot smoother) and g(\cdot)
is again a link function. A generalized additive model (GAM) is based on the following components (Hastie and Tibshirani 1990, Chapt. 6):
-
Response distribution: Given
\mathbf{x}(\mathbf{s}) = x_1(\mathbf{s}), x_2(\mathbf{s}), ..., x_p(\mathbf{s})
, theY(\mathbf{s})
are conditionally independent observations from simple exponential family distributions. -
Link function:
g(\cdot)
relates the expectation\mu(\mathbf{x}(\mathbf{s})) = \mathrm{E}[Y(\mathbf{s})|\mathbf{x}(\mathbf{s})]
of the response distribution to the additive predictor
\sum_{j} f_{j}(x_{j}(\mathbf{s}))
.
geoGAM extend GAM by allowing a more complex form of the additive
predictor (Kneib et al. 2009, Hothorn et al. 2011): First, one can
add a smooth function f_{{\scriptstyle \mathbf{s}}}(\mathbf{s})
of the spatial
coordinates (smooth spatial surface) to the additive predictor to account for residual
autocorrelation.
More complex relationships between Y
and \mathbf{x}
can be modelled
by adding terms like f_{j}(x_{j}(\mathbf{s})) \cdot
f_{k}(x_{k}(\mathbf{s}))
– capturing the effect of interactions
between covariates – and f_{{\scriptstyle \mathbf{s}}}(\mathbf{s}) \cdot
f_{j}(x_{k}(\mathbf{s}))
– accounting for spatially changing
dependence between Y
and \mathbf{x}
. Hence, in its full generality,
a generalized additive model for spatial data is represented by
g(\mu(\mathbf{x}(\mathbf{s}))) = \nu + f(\mathbf{x}(\mathbf{s})) = \nonumber
\nu +
\underbrace{
\sum_{u} f_{j_{u}}(x_{j_{u}}(\mathbf{s})) + \sum_{v}
f_{j_{v}}(x_{j_{v}}(\mathbf{s})) \cdot f_{k_{v}}(x_{k_{v}}(\mathbf{s}))
}_{\mbox{global marginal and interaction effects}} \nonumber
+
\underbrace{ \sum_{w} f_{{\scriptstyle \mathbf{{s}}_{w}}}(\mathbf{s}) \cdot
f_{j_{w}}(x_{j_{w}}(\mathbf{s})) }_{\mbox{nonstationary effects}} +
\underbrace{\hspace{5mm} f_{{\scriptstyle \mathbf{s} }}(\mathbf{s})
\hspace{5mm}}_{\mbox{autocorrelation}}.
Kneib et al. (2009) called the above equation a geoadditive model,
a name coined before by Kammann and Wand 2003 for a combination
of additive models with a geostatistical error model.
It remains to specify what response distributions and link functions
should be used for the various response types: For (possibly
transformed) continuous responses one uses often a normal
response distribution combined with the identity link
g\left(\mu(\mathbf{x}(\mathbf{s}))\right) = \mu(\mathbf{x}(\mathbf{s}))
.
For binary data (coded as 0 and 1), one assumes a Bernoulli
distribution and uses often a logit link
g\left(\mu(\mathbf{x}(\mathbf{s}))\right) =\log\left(
\frac{\mu(\mathbf{x}(\mathbf{s}))}{1-\mu(\mathbf{x}(\mathbf{s}))} \right),
where
\mu(\mathbf{x}(\mathbf{s})) =
\mathrm{Prob}[Y(\mathbf{s})=1\,|\,\mathbf{x}(\mathbf{s})] =
\frac{\exp(\nu +f(\mathbf{x}(\mathbf{s})))}{1+\exp(\nu +f(\mathbf{x}(\mathbf{s})))}.
For ordinal data, with ordered response levels, 1, 2, \ldots, k
,
the cumulative logit or proportional odds model
(Tutz 2012, Sect. 9.1) is used. For any given level r \in (1, 2,
\ldots, k)
, the logarithm of the odds of the event Y(\mathbf{s}) \leq
r \, | \, \mathbf{x}(\mathbf{s})
is then modelled by
\log\left(
\frac{\mathrm{Prob}[Y(\mathbf{s}) \leq
r \, | \, \mathbf{x}(\mathbf{s}))]}{\mathrm{Prob}[Y(\mathbf{s}) > r \, | \,
\mathbf{x}(\mathbf{s}))]}\right) = \nu_{r} + f(\mathbf{x}(\mathbf{s})),
with \nu_{r}
a sequence of level-specific constants satisfying
\nu_{1} \leq \nu_{2} \leq \ldots \leq \nu_{r}
. Conversely,
\mathrm{Prob}[Y(\mathbf{s})\leq r\,|\,\mathbf{x}(\mathbf{s})] =
\frac{\exp(\nu_{r} + f(\mathbf{x}(\mathbf{s})))}{1+\exp(\nu_{r} + f(\mathbf{x}(\mathbf{s})))}.
Note that \mathrm{Prob}[Y(\mathbf{s})\leq r\,|\,\mathbf{x}(\mathbf{s})]
depends on r
only through the constant \nu_{r}
. Hence, the ratio
of the odds of two events Y(\mathbf{s}) \leq r \, | \,
\mathbf{x}(\mathbf{s})
and (\mathbf{s}) \leq r \, | \,
\tilde{\mathbf{x}}(\mathbf{s})
is the same for all r
(Tutz 2012, p. 245).
Model building (selection of covariates)
To build parsimonious models that can readily be checked for agreement understanding in regards to the analized subject. The following steps 1–6 are implemented in geoGAM
toa achieve sparse models in a fully automated way.
In several of these steps tuning parameters are optimized by 10-fold cross-validation with fixed subsets using either root mean squared error (RMSE), continuous responses), Brier score (BS), binary responses) or ranked probability score (RPS), ordinal responses) as optimization criteria (see Wilks, 2011).
To improve the stability of the algorithm continuous covariates are first scaled (by difference of maximum and minimum value) and centred.
Boosting (see step 2 below) is more stable and converges more quickly when the effects of categorical covariates (factors) are accounted for as model offset. Therefore, the group lasso (least absolute shrinkage and selection operator, Breheny and Huang 2015,
grpreg
)) – an algorithm that likely excludes non-relevant covariates and treats factors as groups – is used to select important factors for the offset. For ordinal responses stepwise proportional odds logistic regression in both directions with BIC (e. g. Faraway 2005, p. 126) is used to select the offset covariates because lasso cannot be used for such responses.Next, a subset of relevant factors, continuous covariates and spatial effects is selected by componentwise gradient boosting. Boosting is a slow stagewise additive learning algorithm. It expands
f(\mathbf{x}(\mathbf{s}))
in a set of base procedures (baselearners) and approximates the additive predictor by a finite sum of them as follows (Buehlmann and Hothorn 2007):Initialize
\hat{f}( \mathbf{x}(\mathbf{s}))^{[m]}
with offset of step 1 above and setm=0
.Increase
m
by 1. Compute the negative gradient vector\mathbf{U}^{[m]}
(e.g. residuals) for a loss functionl(\cdot)
.Fit all baselearners
g( \mathbf{x}(\mathbf{s}))_{1..p}
to\mathbf{U}^{[m]}
and select the baselearner, sayg(\mathbf{x}(\mathbf{s}))_{j}^{[m]}
that minimizesl(\cdot)
.Update
\hat{f}( \mathbf{x}(\mathbf{s}))^{[m]} = \hat{f}( \mathbf{x}(\mathbf{s}))^{[m-1]} + v\cdot g( \mathbf{x}(\mathbf{s}))_{j}^{[m]}
with step sizev\leq1
.Iterate steps (b) to (d) until
m = m_{stop}
(main tuning parameter).
The following settings are used in above algorithm: As loss functions
l(\cdot)
L_2
is used for continuous, negative binomial likelihood for binary (Buehlmann and Hothorn 2007) and proportional odds likelihood for ordinal responses (Schmid et al. 2011). Early stopping of the boosting algorithm is achieved by determining optimalm_{stop}
by cross-validation. Default step length (\upsilon = 0.1
) is used. This is not a sensitive parameter as long as it is clearly below 1 (Hofner et al. 2014). For continuous covariates penalized smoothing spline baselearners (Kneib et al. 2009) are used. Factors are treated as linear baselearners. To capture residual autocorrelation a bivariate tensor-product P-spline of spatial coordinates (Wood 2006, pp. 162) is added to the additive predictor. Spatially varying effects are modelled by baselearners formed by multiplication of continuous covariates with tensor-product P-splines of spatial coordinates (Wood 2006, pp. 168). Uneven degree of freedom of baselearners biases baselearner selection (Hofner et al. 2011b). Therefore, each baselearner is penalized to 5 degrees of freedom (df
). Factors with less than 6 levels (df<5
) are aggregated to grouped baselearners. By using an offset, effects of important factors with more than 6 levels are implicitly accounted for without penalization.At
m_{stop}
(see step 2 above), many included baselearners may have very small effects only. To remove these the effect sizee_j
of each baselearnerf_j(x_j(\mathbf{s}))
is computed. For factors the effect sizee_j
is the largest difference between effects of two levels and for continuous covariates it is equal to the maximum contrast of estimated partial effects (after removal of extreme values as in boxplots, Frigge et al. 1989). Generalized additive models (GAM, Wood 2011) are fitted including smooth and factor effects depending on the effect sizee_j
of the corresponding baselearnerj
. The procedure iterates throughe_j
and excludes covariates withe_j
smaller than a threshold effect sizee_t
. Optimale_t
is determined by 10-fold cross-validation of GAM. In these GAM fits smooth effects are penalized to 5 degrees of freedom as imposed by componentwise gradient boosting (step 2 above). The factors selected as offset in step 1 are included in the main GAM, that is now fitted without offset.The GAM is further reduced by stepwise removal of covariates by cross-validation. The candidate covariate to drop is chosen by largest
p
value ofF
tests for linear factors and approximateF
test (Wood 2011) for smooth terms.Factor levels with similar estimated effects are merged stepwise again by cross-validation based on largest
p
values from two samplet
-tests of partial residuals.The final model (used to compute spatial predictions) results ideally in a parsimonious GAM. Because of step 5, factors have possibly a reduced number of coefficients. Effects of continuous covariates are modelled by smooth functions and – if at all present – spatially structured residual variation (autocorrelation) is represented by a smooth spatial surface. To avoid over-fitting both types of smooth effects are penalized to 5 degrees of freedom (as imposed by step 2).
Value
Object of class geoGAM
:
offset.grplasso |
Cross validation for grouped LASSO, object of class |
offset.factors |
Character vector of factor names chosen for the offset computation. Empty for |
gamboost |
Gradient boosting with smooth components, object of class |
gamboost.cv |
Cross validation for gradient boosting, object of class |
gamboost.mstop |
Mstop used for gamboost. |
gamback.cv |
List of cross validation error for tuning parameter magnitude. |
gamback.backward |
List of cross validation error path for backward selection of |
gamback.aggregation |
List(s) of cross validation error path for aggregation of factor levels. |
gam.final |
Final selected geoadditive model fit, object of class |
gam.final.cv |
Data frame with original response and cross validation predictions. |
gam.final.extern |
Data frame with original response data and predictions of |
data |
Original data frame for model calibration. |
parameters |
List of parameters handed to geoGAM (used for subsequent bootstrap of prediction intervals). |
Author(s)
M. Nussbaum
References
Breheny, P. and Huang, J., 2015. Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Statistics and Computing, 25, 173–187.
Buehlmann, P. and Hothorn, T., 2007. Boosting algorithms: Regularization, prediction and model fitting, Stat Sci, 22, 477–505, doi:10.1214/07-sts242.
Faraway, J. J., 2005. Linear Models with R, vol. 63 of Texts in Statistical Science, Chapman & Hall/CRC, Boca Raton.
Frigge, M., Hoaglin, D. C., and Iglewicz, B., 1989. Some implementations of the boxplot. The American Statistician, 43(1), 50–54.
Hastie, T. J. and Tibshirani, R. J., 1990. Generalized Additive Models, vol. 43 of Monographs on Statistics and Applied Probability, Chapman and Hall, London.
Hofner, B., Hothorn, T., Kneib, T., and Schmid, M., 2011. A framework for unbiased model selection based on boosting. Journal of Computational and Graphical Statistics, 20(4), 956–971.
Hofner, B., Mayr, A., Robinzonov, N., and Schmid, M., 2014. Model-based boosting in R: A hands-on tutorial using the R package mboost, Computation Stat, 29, 3–35, doi:10.1007/s00180-012-0382-5.
Hothorn, T., Mueller, J., Schroder, B., Kneib, T., and Brandl, R., 2011. Decomposing environmental, spatial, and spatiotemporal components of species distributions, Ecol Monogr, 81, 329–347.
Kneib, T., Hothorn, T., and Tutz, G., 2009. Variable selection and model choice in geoadditive regression models. Biometrics, 65(2), 626–634.
Nussbaum, M., Walthert, L., Fraefel, M., Greiner, L., and Papritz, A.: Mapping of soil properties at high resolution in Switzerland using boosted geoadditive models, SOIL, 3, 191-210, doi:10.5194/soil-3-191-2017, 2017.
Schmid, M., Hothorn, T., Maloney, K. O., Weller, D. E., and Potapov, S., 2011. Geoadditive regression modeling of stream biological condition, Environ Ecol Stat, 18, 709–733, doi:10.1007/s10651-010-0158-4.
Tutz, G., 2012, Regression for Categorical Data, Cambridge University Press, doi:10.1017/cbo9780511842061.
Wilks, D. S., 2011. Statistical Methods in the Atmospheric Sciences, Academic Press, 3 edn.
Wood, S. N., 2006. Generalized Additive Models: An Introduction with R, Chapman and Hall/CRC.
Wood, S. N., 2011. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society (B), 73(1), 3–36.
See Also
The model selection is based on packages grpreg
(function cv.grpreg
), MASS
(function polr
), mboost
(functions gamboost
, cv
, cvrisk
) and mgcv
(function gam
). For further information please see documentation and vignettes for these packages.
Examples
### small examples with earthquake data
data(quakes)
set.seed(2)
quakes <- quakes[ sample(1:nrow(quakes), 50), ]
quakes.geogam <- geoGAM(response = "mag",
covariates = c("depth", "stations"),
data = quakes,
seed = 2,
max.stop = 5,
cores = 1)
summary(quakes.geogam)
data(quakes)
# create grouped factor with reduced number of levels
quakes$stations <- factor( cut( quakes$stations, breaks = c(0,15,19,23,30,39,132)) )
quakes.geogam <- geoGAM(response = "mag",
covariates = c("stations", "depth"),
coords = c("lat", "long"),
data = quakes,
max.stop = 10,
cores = 1)
summary(quakes.geogam)
summary(quakes.geogam, what = "path")
## Use soil data set of soil mapping study area near Berne
data(berne)
set.seed(1)
# Split data sets and
# remove rows with missing values in response and covariates
d.cal <- berne[ berne$dataset == "calibration" & complete.cases(berne), ]
d.val <- berne[ berne$dataset == "validation" & complete.cases(berne), ]
### Model selection for continuous response
ph10.geogam <- geoGAM(response = "ph.0.10",
covariates = names(d.cal)[14:ncol(d.cal)],
coords = c("x", "y"),
data = d.cal,
offset = TRUE,
sets = mboost::cv(rep(1, nrow(d.cal)), type = "kfold"),
validation.data = d.val,
cores = 1)
summary(ph10.geogam)
summary(ph10.geogam, what = "path")
### Model selection for binary response
waterlog100.geogam <- geoGAM(response = "waterlog.100",
covariates = names(d.cal)[c(14:54, 56:ncol(d.cal))],
coords = c("x", "y"),
data = d.cal,
offset = FALSE,
sets = sample( cut(seq(1,nrow(d.cal)),breaks=10,labels=FALSE) ),
validation.data = d.val,
cores = 1)
summary(waterlog100.geogam)
summary(waterlog100.geogam, what = "path")
### Model selection for ordered response
dclass.geogam <- geoGAM(response = "dclass",
covariates = names(d.cal)[14:ncol(d.cal)],
coords = c("x", "y"),
data = d.cal,
offset = TRUE,
non.stationary = TRUE,
seed = 1,
validation.data = d.val,
cores = 1)
summary(dclass.geogam)
summary(dclass.geogam, what = "path")