boral {boral} | R Documentation |
Fitting boral (Bayesian Ordination and Regression AnaLysis) models
Description
Bayesian ordination and regression models for analyzing multivariate data in ecology. Three "types" of models may be fitted: 1) With covariates and no latent variables, boral fits independent response GLMs; 2) With no covariates, boral fits a pure latent variable model; 3) With covariates and latent variables, boral fits correlated response GLMs.
Usage
boral(y, ...)
## Default S3 method:
boral(y, X = NULL, formula.X = NULL, X.ind = NULL,
traits = NULL, which.traits = NULL, family, trial.size = 1,
lv.control = list(num.lv = 0, type = "independent", distmat = NULL),
row.eff = "none", row.ids = NULL, ranef.ids = NULL,
offset = NULL, save.model = FALSE, calc.ics = FALSE,
mcmc.control = list(n.burnin = 10000, n.iteration = 40000,
n.thin = 30, seed = NULL),
prior.control = list(type = c("normal","normal","normal","uniform"),
hypparams = c(10, 10, 10, 30), ssvs.index = -1, ssvs.g = 1e-6,
ssvs.traitsindex = -1), do.fit = TRUE, model.name = NULL, num.lv = NULL, ...)
## S3 method for class 'boral'
print(x, ...)
Arguments
y |
A response matrix of multivariate data e.g., counts, binomial or Bernoulli responses, continuous response, and so on. With multivariate abundance data ecology for instance, rows correspond to sites and columns correspond to species. Any categorical (multinomial) responses must be converted to integer values. For ordinal data, the minimum level of the response matrix must be 1 instead of 0. |
X |
Either a model matrix of covariates (otherwise known as the covariate matrix), which is included as part of the model, or a data frame from which the argument |
X.ind |
An matrix of 1s and 0s, indicating whether a particular covariate should be included (1) or excluded (0) in the mean structure of a particular response. The matrix should the number of rows equal to the number of columns in the response matrix, and the number of columns equal to the number of columns in the covariate matrix Defaults to |
formula.X |
an object of class "formula", which represents a symbolic description of the covariate matrix to be created (based on the this argument along with the Note that if this argument is supplied, then after fitting boral will return an |
x |
An object for class "boral". |
traits |
A model matrix of species traits (otherwise known as the trait matrix), which can be included as part of the model. Defaults to |
which.traits |
A list of length equal to (number of predictor variables in the model as implied by For example, if Defaults to |
family |
Either a single element, or a vector of length equal to the number of columns in the response matrix. The former assumes all columns of the response matrix come from this distribution. The latter option allows for different distributions for each column of the response matrix. Elements can be one of "binomial" (with probit link), "poisson" (with log link), "negative.binomial" (with log link), "normal" (with identity link), "lnormal" for lognormal (with log link), "tweedie" (with log link), "exponential" (with log link), "gamma" (with log link), "beta" (with logit link), "ordinal" (cumulative probit regression), "ztpoisson" (zero truncated Poisson with log link), "ztnegative.binomial" (zero truncated negative binomial with log link). Please see |
trial.size |
Either equal to a single element, or a vector of length equal to the number of columns in y. If a single element, then all columns assumed to be binomially distributed will have trial size set to this. If a vector, different trial sizes are allowed in each column of y. The argument is ignored for all columns not assumed to be binomially distributed. Defaults to 1, i.e. Bernoulli distribution. |
lv.control |
A list (currently) with the following arguments:
Please see |
row.eff |
Single element indicating whether (multiple) row effects are included as fixed effects ("fixed"), random effects ("random") or not included ("none") in the model. If fixed effects, then for parameter identifiability the first row effect is set to zero, which analogous to acting as a reference level when dummy variables are used. If random effects, they are drawn from a normal distribution with mean zero and unknown variance, analogous to a random intercept in mixed models. Defaults to "none". |
row.ids |
A matrix with the number of rows equal to the number of rows in the response matrix, and the number of columns equal to the number of row effects to be included in the model. Element |
ranef.ids |
A matrix with the number of rows equal to the number of rows in the response matrix, and the number of columns equal to the number of random intercepts to be included in the model. Element |
offset |
A matrix with the same dimensions as the response matrix, specifying an a-priori known component to be included in the linear predictor during fitting. Defaults to |
save.model |
If |
calc.ics |
If |
mcmc.control |
A list of parameters for controlling the MCMC sampling. Values not set will assume default values. These include:
|
prior.control |
A list of parameters for controlling the prior distributions. Values not set will assume default values. These include:
|
do.fit |
If |
model.name |
Name of the text file that the JAGS script is written to. Defaults to |
num.lv |
Old argument superceded by |
... |
Not used. |
Details
The boral package is designed to fit three types of models which may be useful in ecology (and probably outside of ecology as well =D).
Independent response models: boral allows explanatory variables to be entered into the model via the covariate matrix \bm{X}
. If factors are to be included, then they should be parameterized using dummy variables. It should NOT include an intercept column. Alternatively, users can make use of the formula.X
function to create the covariate model.
Without latent variables, i.e. lv.control$num.lv = 0
, boral fits separate GLMs to each column of the n \times p
response matrix \bm{Y}
, where the columns are assumed to be independent.
g(\mu_{ij}) = \alpha_i + \beta_{0j} + \bm{x}^\top_i\bm{\beta}_j + \bm{z}^\top_i\bm{b}_j; \quad i = 1,\ldots,n; j = 1,\ldots,p,
where the mean response for element (i,j), denoted as mu_{ij}
, is regressed against the covariates \bm{x}_i
via a link function g(\cdot)
. The quantities beta_{0j}
and \bm{beta}_j
denote the response-specific intercepts and coefficients respectively, while alpha_i
is an optional row effect that may be treated as a fixed or random effect, and \bm{z}^\top_i\bm{b}_j
denote an optional set of response-specific random intercepts. Not all of these components may be included in the model, and the above is just representing the general case. If the included row effects are assumed to be fixed, then the first row effect is constrained to be zero for parameter identifiability reasons. If the include row effects are assumed to be random then they are drawn from a normal distribution with unknown variance \phi^2
. One reason we might want to include row effects is to account differences in sampling intensity between sites: these can lead to differences in site total abundance, and so by including fixed effects they play the same role as an offset to account for these differences.
boral can also handle multiple, hierarchical row effects, which may be useful to account for sampling design. This is controlled using the row.ids
argument. For example, if the first five rows of y
correspond to replications from site 1, the next five rows correspond to replications from site 2, and so on, then one can set row.ids = matrix(c(1,1,1,1,1,2,2,2,2,2,...), ncol = 1)
to take this in account. While this way of coding row effects via the row.ids
argument takes some getting used to, it has been done this way partly to force the user to think more carefully about exactly the structure of the data i.e., with great power comes great responsibility...
Aside from row effects, and another more flexible way to account for sampling design but on a response-specific basis, boral allows users to include response-specific random intercepts. Please see about.ranefs
for more information on this. Note response-specific random intercepts are permitted for all three types of models discussed. Akin to other packages such as lme4
or glmmTMB
, all random intercepts are assumed to be normally distributed with the corresponding variance components are assumed to be response-specific.
If offset
is supplied, then it will be included in the linear predictor below (and all linear predictors below where appropriate).
Without row effects, the above independent response model is basically a Bayesian analog of the manyglm
function in the mvabund
package (Wang et al., 2013). Note that X.ind
argument can be optionally used to manually force certain covariates to be included in (1) or excluded from (0) from the mean structure of specific responses.
Pure latent variable models: If no explanatory variables are included and lv.control$num.lv
> 0, boral fits a pure latent variable model to perform model-based unconstrained ordination (Hui et al., 2014),
g(\mu_{ij}) = \alpha_i + \beta_{0j} + \bm{z}^\top_i\bm{b}_j + \bm{u}\top_i\bm{\theta}_j,
where instead of measured covariates, we now have a vector of latent variables \bm{u}_i
with \bm{\theta}_j
being the response-specific coefficients relating to these latent variables. The response-specific intercept, \beta_{0j}
, accounts for differences between species prevalence, while the row effect, alpha_i
, is included to account for differences in site total abundance (typically assuming a fixed effect, row.eff = "fixed"
, although see Jamil and ter Braak, 2013, for a motivation for using random site effects), so that the ordination is then in terms of species composition. If \alpha_i
is omitted from the model i.e., row.eff = FALSE
, then the ordination will be in terms of relative species abundance. As mentioned previously, one reason for including fixed row effects is to account of any potential differences in sampling intensity between sites.
As with the other types of models, \alpha_i
can be replaced with more sophisticated multiple, hierarchical row effects, and/or response-specific random intercepts given by \bm{z}^\top_i\bm{b}_j
are optional.
Unconstrained ordination is used for visualizing multivariate data in a low-dimensional space, without reference to covariates (Chapter 9, Legendre and Legendre, 2012). Typically, lv.control$num.lv
= 1 to 3 latent variables is used, allowing the latent variables to plotted (using lvsplot
, for instance). The resulting plot can be interpreted in the same manner as plots from Nonmetric Multi-dimensional Scaling (NMDS, Kruskal, 1964) and Correspondence Analysis (CA, Hill, 1974), for example. A biplot can also be constructed by setting biplot = TRUE
when using lvsplot
, so that both the latent variables and their corresponding coefficients are plotted. For instance, with multivariate abundance data, biplots are used to visualize the relationships between sites in terms of species abundance or composition, as well as the indicator species for the sites.
Finally, boral offers a small number of options for allowing the latent variables to be correlated across rows of the responses. This may be useful when one has a-priori information about correlation between sites e.g., spatial correlation, which cannot be systematically accounted for by the inclusion of random effects (Thorson et al., 2015, 2016; Ovaskainen et al., 2016, 2017).. Please see the help file about.lvs
for more information on this. By default, boral assumes the latent variables are independent standard normally distributed across rows. Note the use of a non-independence correlation structure massively increases computation time!
Correlated response models: When one or more latent variables are included in conjunction with covariates i.e., a covariate matrix is supplied and lv.control$num.lv
> 1, boral fits separate GLMs to each column of the response matrix \bm{Y}
while allowing for residual correlation between columns via the latent variables. This is quite useful for multivariate abundance data in ecology, where a separate GLM is fitted to species (modeling its response against environmental covariates), while accounting for the fact species at a site are likely to be correlated for reason other than similarities in environmental responses, e.g. biotic interaction, phylogeny, and so on. Correlated response model take the following form,
g(\mu_{ij}) = \alpha_i + \beta_{0j} + \bm{x}^\top_i\bm{\beta}_j + \bm{z}^\top_i\bm{b}_j + \bm{u}^\top_i\bm{\theta}_j.
This model is thus a combinaton of the first two types of models. The linear predictor \bm{u}^\top_i\bm{\theta}_j
induces a residual covariance between the columns of the response matrix \bm{y}
(which is of rank lv.control$num.lv
). For multivariate abundance data, this leads to a parsimonious method of accounting for correlation between species not due to the shared environmental responses. After fitting the model, the residual correlation matrix then can be obtained via the get.residual.cor
function. Note lv.control$num.lv
> 1 is necessarily in order to flexibly model the residual correlations; see Pollock et al. (2014) for residual correlation matrices in the context of Joint Species Distribution Models, Ovaskainen et al., (2017), and Warton et al. (2015, 2016) for an overview of latent variable models in multivariate ecology.
As with the other types of models, \alpha_i
can be replaced with more sophisticated multiple, hierarchical row effects, and/or response-specific random intercepts given by \bm{z}^\top_i\bm{b}_j
are optional.
Distributions: A variety of families are available in boral, designed to handle multivariate abundance data of varying response types. Please see about.distributions
for more information on this.
Including species traits: When covariates are included i.e. both the independent and correlated response models, one has the option of also including traits to help explain differences in species environmental responses to these covariates. Please see about.traits
for more information on this.
Including response-specific random intercepts: For some types of data e.g. nested designs, it may be appropriate to include response-specific random intercepts. This can be considered as like row effects, but now the effects are specific to each response rather than the same across all responses. Please see about.ranefs
for more information on this. Note as a consequence, it is usually not recommended to have both random row effects and response-specific random intercepts simultaneously in the same model (unless they are were at different levels of of the data).
Estimation: Estimation for all models is performed using Bayesian Markov Chain Monte Carlo (MCMC) methods via JAGS (Plummer, 2003). Please note that only one MCMC chain in run: this point is discussed later in this help file. Regarding prior distributions, the default settings, based on the prior.control
argument, are as follows:
Normal priors with mean zero and variance given by element 1 in
hypparams
are assigned to all intercepts, cutoffs for ordinal responses, and row effects.Normal priors with mean zero and variance given by element 2 in
hypparams
are assigned coefficients relating to latent variables,\bm{\theta}_j
.Normal priors with mean zero and variance given by element 3
hypparams
are assigned to all coefficients relating to covariates in\bm{\beta}_j
. If traits are included, the same normal priors are assigned to the\kappa
's, and the standard deviation\sigma_k
are assigned uniform priors with maximum equal to element 4 inhypparams
.For the negative binomial, normal, lognormal, and tweedie distributions, uniform priors with maximum equal to element 4 in
hypparams
are used on the dispersion parameters. Please note that for the normal and lognormal distributions, these uniform priors are assigned to the standard deviations\phi
(see Gelman, 2006). If there are any variance (standard deviation, to be precise) parameters in the model, then these are also assigned uniform priors with maximum equal to element 4 inhypparams
e.g., standard deviation of the normal random effects if the row effects are assumed to random, the standard deviation of the normal random response-specific intercepts if more than two columns are ordinal responses, and the standard deviation of the normal random response-specific random intercepts whenranef.ids
is supplied etc...
Using information criteria at your own risk: Using information criterion from calc.ics = TRUE
for model selection should be done with extreme caution, for two reasons: 1) The implementation of some of these criteria is heuristic and experimental, 2) Deciding what model to fit should also be driven by the science and questions of interest. For example, it may be the case that a criterion suggests a model with 3 or 4 latent variables is more appropriate. However, if we are interested in visualizing the data for ordination purposes, then models with 1 or 2 latent variables are more appropriate. As another example, whether or not we include row effects when ordinating multivariate abundance data depends on if we are interested in differences between sites in terms of relative species abundance (row.eff = "none"
) or species composition (row.eff = "fixed"
). From version 1.6, the calculation of all information criteria is being gradually phased out!
SSVS: Stochastic search variable selection (SSVS, George and McCulloch, 1993) is also implemented for the response-specific coefficients \bm{\beta}_j
. Please see about.ssvs
for more information on this approach.
Value
An object of class "boral" is returned, being a list containing (but not limited to) the following components where applicable:
call |
The matched call. |
lv.coefs.mean/median/sd/iqr |
Matrices containing the mean/median/standard deviation/interquartile range of the posterior distributions of the latent variable coefficients. This also includes the response-specific intercepts, and dispersion parameters if appropriate. |
lv.mean/median/sd/iqr |
A matrix containing the mean/median/standard deviation/interquartile range of the posterior distributions of the latent variables. |
lv.covparams.mean/median/sd/iqr |
A matrix containing the mean/median/standard deviation/interquartile range of the posterior distributions for the parameters characterizing the correlation structure of the latent variables when they are assumed to be non-independent across rows. |
X.coefs.mean/median/sd/iqr |
Matrices containing the mean/median/standard deviation/interquartile range of the posterior distributions of the response-specific coefficients relating to the covariate matrix. |
traits.coefs.mean/median/sd/iqr |
Matrices containing the mean/median/standard deviation/interquartile range of the posterior distributions of the coefficients and standard deviation relating to the species traits; please see |
cutoffs.mean/median/sd/iqr |
Vectors containing the mean/median/standard deviation/interquartile range of the posterior distributions of the common cutoffs for ordinal responses (please see the not-so-brief tangent on distributions above). |
ordinal.sigma.mean/median/sd/iqr |
Scalars containing the mean/median/standard deviation/interquartile range of the posterior distributions of the standard deviation for the random intercept normal distribution corresponding to the ordinal response columns. |
powerparam.mean/median/sd/iqr |
Scalars for the mean/median/standard deviation/interquartile range of the posterior distributions of the common power parameter for tweedie responses (please see the not-so-brief tangent on distributions above). |
row.coefs.mean/median/sd/iq |
A list with each element containing the vectors of the mean/median/standard deviation/interquartile range of the posterior distributions of the row effects. The length of the list is equal to the number of row effects included i.e., |
row.sigma.mean/median/sd/iqr |
A list with each element containing the mean/median/standard deviation/interquartile range of the posterior distributions of the standard deviation for the row random effects normal distribution. The length of the list is equal to the number of row effects included i.e., |
ranef.coefs.mean/median/sd/iq |
A list with each element containing the matrices of the mean/median/standard deviation/interquartile range of the posterior distributions of the response-specific random intercepts. The length of the list is equal to the number of random intercepts included i.e., |
ranef.sigma.mean/median/sd/iqr |
A matrix containing the mean/median/standard deviation/interquartile range of the posterior distributions of the standard deviation for the response-specific random intercept normal distribution. The number of columns of the matrix is equal to number of random intercepts included i.e., |
ssvs.indcoefs.mean/ssvs.indcoefs.sd |
Matrices containing posterior probabilities and associated standard deviation for individual SSVS of coefficients in the covariate matrix. |
ssvs.gpcoefs.mean/ssvs.gpcoefs.sd |
Matrices containing posterior probabilities and associated standard deviation for group SSVS of coefficients in the covariate matrix. |
ssvs.traitscoefs.mean/ssvs.traitscoefs.sd |
Matrices containing posterior probabilities and associated standard deviation for individual SSVS of coefficients relating to species traits. |
hpdintervals |
A list containing components which correspond to the lower and upper bounds of highest posterior density (HPD) intervals for all the parameters indicated above. Please see |
ics |
If |
jags.model |
If |
geweke.diag |
A list with two elements. The first element is itself a list containing the Geweke convergence diagnostic (Z-scores) for all appropriate parameters in the model. The second element contains the proportion of Z-scores that whose corresponding p-value is less than 0.05. No adjustment is made for multiple comparison on the p-values. Please see the section Why is only one MCMC chain run? for more information on this diagnostic. |
n , p , family , trial.size , num.lv , ... |
Various attributes of the model fitted, including the dimension of the response matrix, the response and model matrix used, distributional assumptions and trial sizes, number of latent variables, the number of covariates and traits, hyperparameters used in the Bayesian estimation, indices for SSVS, the number of levels for ordinal responses, and n.burnin, n.iteration and n.thin. |
Why is only one MCMC chain run?
Much like the MCMCfactanal
function in the MCMCpack
package (Martin et al., 2011) for conducting factor analysis, which is a special case of the pure latent variable model with Gaussian responses, boral deliberately runs only one MCMC chain. This runs contrary to the recommendation of most Bayesian analyses, where the advice is to run multiple MCMC chains and check convergence using (most commonly) the Gelman-Rubin statistic (Gelman et al., 2013). The main reason for this is that, in the context of MCMC sampling, the latent variable model is invariant to a switch of the sign, i.e. \bm{u}^\top_i\bm{\theta}_j = (-\bm{u}^\top_i(-\bm{\theta}_j)
, and so is actually unidentifiable.
As a result of sign-switching, different MCMC chains can produce latent variables and corresponding coefficients values that, while having similar magnitudes, will be different in sign. Consequently, combining MCMC chains and checking Rhats, computing posterior means and medians etc...becomes complicated (in principle, one way to resolve this problem would be to post-process the MCMC chains and deal with sign switching, but this is really hard!). Therefore, to alleviate this issue together, boral chooses to only run one MCMC chain.
What does this mean for the user?
boral automatically calculates the Geweke convergence diagnostic (Geweke, 1992), which is a diagnostic applicable with only one MCMC chain; please see the help file
geweke.diag
in thecoda
package for more information. The output is a list containing Z-scores for the appropriate parameters in the model, and each score can be interpreted in the same manner as the test statistic from conducting a Z-test i.e., if the score exeeds roughly 1.96 then the p-value is less than 0.05, and there is evidence the MCMC chain (for this particular parameter) has not converged.The output from boral also provides the proportion of Z-scores whose corresponding p-values are less than 0.05. Of course, because there are a large number of parameters in the model, then there are large number of Z-scores, and boral does not make any multiple comparison adjustment for this when calculating the number of “significant" Z-scores. If you do indeed want to use this diagnostic to formally check for convergence, then we recommend you conduct some adjustment e.g., using Holm's method, by doing something such as
gew.pvals <- 2*pnorm(abs(unlist(fit$geweke.diag[[1]])), lower.tail = FALSE)
and thenp.adjust(gew.pvals, method = "holm")
.For checking convergence, we recommend you look at trace plots of the MCMC chains. Using the
coda
package, which is automatically loaded when theboral
package is loaded, try something likeplot(get.mcmcsamples(fit))
.If you have a lot of data, e.g. lots of sites compared to species, sign-switching tends to be less of problem and pops up less often.
Warnings
-
No intercept column should be included in the covariate matrix. Column-specific intercepts are estimated automatically and given by the first column of
lv.coefs
. Similarly, no intercept column should be included in the trait matrix, as it is included automatically. As of version 1.6, functions to calculate information criteria along with
calc.marglogLik
are no longer updated, and being phrased out!MCMC with a non-independence correlation structure for the latent variables takes an especially long time to run! Likewise, MCMC with lots of ordinal columns take an especially long time to run! Moreover, estimates for the cutoffs in cumulative probit regression may be poor for levels with little data. Major apologies for this advance =(
Summaries of the coefficients such as posterior medians and HPD intervals may also be problematic when SSVS is being used, since the posterior distribution will be multi-modal.
If
save.model = TRUE
, the raw jags model is also returned. This can be quite very memory-consuming, since it indirectly saves all the MCMC samples.
Author(s)
Francis K.C. Hui [aut, cre], Wade Blanchard [aut]
Maintainer: Francis K.C. Hui <fhui28@gmail.com>
References
Gelman A. (2006) Prior distributions for variance parameters in hierarchical models. Bayesian Analysis 1, 515-533.
Gelman et al. (2008). A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 2, 1360-1383.
Gelman et al. (2013) Bayesian Data Analysis. CRC Press.
George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association, 85, 398-409.
Geweke, J. (1992) Evaluating the accuracy of sampling-based approaches to calculating posterior moments. In Bayesian Statistics 4 (editors JM Bernado, JO Berger, AP Dawid and AFM Smith). Clarendon Press.
Hui et al. (2014). Model-based approaches to unconstrained ordination. Methods in Ecology and Evolution, 6, 399-411.
Hill, M. O. (1974). Correspondence analysis: a neglected multivariate method. Applied statistics, 23, 340-354.
Jamil, T., and ter Braak, C.J.F. (2013). Generalized linear mixed models can detect unimodal species-environment relationships. PeerJ 1: e95.
Kruskal, J. B. (1964). Nonmetric multidimensional scaling: a numerical method. Psychometrika, 29, 115-129.
Legendre, P. and Legendre, L. (2012). Numerical ecology, Volume 20. Elsevier.
Martin et al. (2011). MCMCpack: Markov Chain Monte Carlo in R. Journal of Statistical Software, 42, 1-21. URL: http://www.jstatsoft.org/v42/i09/.
McLachlan, G., and Peel, D. (2004). Finite Mixture Models. Wiley.
Ovaskainen, et al. (2016). Uncovering hidden spatial structure in species communities with spatially explicit joint species distribution models. Methods in Ecology and Evolution, 7, 428-436.
Ovaskainen, et al. (2017). How to make more out of community data? A conceptual framework and its implementation as models and software. Ecology Letters, 20, 561-576.
Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing. March (pp. 20-22).
Pollock, et al. (2014). Understanding co-occurrence by modelling species simultaneously with a Joint Species Distribution Model (JSDM). Methods in Ecology and Evolution, 5, 397-406.
Skrondal, A., and Rabe-Hesketh, S. (2004). Generalized latent variable modeling: Multilevel, longitudinal, and structural equation models. CRC Press.
Thorson, et al. (2016). Joint dynamic species distribution models: a tool for community ordination and spatio-temporal monitoring. Global Ecology and Biogeography, 25, 1144-1158
Thorson, et al. (2015). Spatial factor analysis: a new tool for estimating joint species distributions and correlations in species range. Methods in Ecology and Evolution, 6, 627-63
Warton et al. (2012). Distance-based multivariate analyses confound location and dispersion effects. Methods in Ecology and Evolution, 3, 89-101.
Warton et al. (2015). So Many Variables: Joint Modeling in Community Ecology. Trends in Ecology and Evolution, 30, 766-779.
Warton et al. (2016). Extending joint models in community ecology: A response to Beissinger et al. Trends in ecology & evolution, 31, 737-738.
Wang et al. (2013).
mvabund
: statistical methods for analysing multivariate abundance data. R package version 3.8.4.
See Also
calc.varpart
to calculate variance partitioning of the covariates,
coefsplot
for horizontal line or "caterpillar plot" of the regression coefficients corresponding to the covariate matrix (if applicable),
get.enviro.cor
and get.residual.cor
for calculating the correlation matrix between the explanatory variables in the covariate matrix and the residual correlation matrix respectively,
lvsplot
for a scatter plot of the latent variables (and their coefficients if applicable),
predict.boral
for calculating predictions from a fitted model.
ranefsplot
for horizontal line or "caterpillar plot" of the response-specific random effects predictons (if applicable),
summary.boral
for a summary of the fitted model,
Examples
library(mvabund) ## Load a dataset from the mvabund package
data(spider)
y <- spider$abun
X <- scale(spider$x)
n <- nrow(y)
p <- ncol(y)
## NOTE: The values below MUST NOT be used in a real application;
## they are only used here to make the examples run quick!!!
example_mcmc_control <- list(n.burnin = 10, n.iteration = 100,
n.thin = 1)
testpath <- file.path(tempdir(), "jagsboralmodel.txt")
## Example 1 - model with two latent variables, site effects,
## and no environmental covariates
spiderfit_nb <- boral(y, family = "negative.binomial",
lv.control = list(num.lv = 2), row.eff = "fixed",
mcmc.control = example_mcmc_control, model.name = testpath)
summary(spiderfit_nb)
par(mfrow = c(2,2))
plot(spiderfit_nb) ## Plots used in residual analysis,
## Used to check if assumptions such an mean-variance relationship
## are adequately satisfied.
lvsplot(spiderfit_nb) ## Biplot of the latent variables,
## which can be interpreted in the same manner as an ordination plot.
## Not run:
## Example 2a - model with no latent variables, no site effects,
## and environmental covariates
spiderfit_nb <- boral(y, X = X, family = "negative.binomial",
mcmc.control = example_mcmc_control, model.name = testpath)
## Alternatively, you can use the formula.X argument for more custom
## creation of a covariate matrix. For example, for have a covariate
## including all predictors except bare sand:
# spiderfit_nb <- boral(y, X = X, formula.X = ~ . - bare.sand,
# family = "negative.binomial", mcmc.control = example_mcmc_control,
# model.name = testpath)
summary(spiderfit_nb)
## The results can be compared with the default example from
## the manyglm() function in mvabund.
## Caterpillar plots for the coefficients
par(mfrow=c(2,3), mar = c(5,6,1,1))
sapply(colnames(spiderfit_nb$X), coefsplot, object = spiderfit_nb)
## Example 2b - suppose now, for some reason, the 28 rows were
## sampled such into four replications of seven sites
## Let us account for this as a fixed effect
spiderfit_nb <- boral(y, X = X, family = "negative.binomial",
row.eff = "fixed", row.ids = matrix(rep(1:7,each=4),ncol=1),
mcmc.control = example_mcmc_control, model.name = testpath)
spiderfit_nb$row.coefs
## Example 2c - suppose now, for some reason, the 28 rows reflected
## a nested design with seven regions, each with four sub-regions
## We can account for this nesting as a random effect
spiderfit_nb <- boral(y, X = X, family = "negative.binomial",
row.eff = "random",
row.ids = cbind(1:n, rep(1:7,each=4)),
mcmc.control = example_mcmc_control, model.name = testpath)
spiderfit_nb$row.coef.median
## Example 2d - model with environmental covariates and
## two structured latent variables using fake distance matrix
fakedistmat <- as.matrix(dist(1:n))
spiderfit_lvstruc <- boral(y, X = X, family = "negative.binomial",
lv.control = list(num.lv = 2, type = "exponential", distmat = fakedistmat),
mcmc.control = example_mcmc_control, model.name = testpath)
summary(spiderfit_lvstruc)
## Example 2e - Similar to 2c, but we will species-specific random intercepts
## for the seven regions (with row effects in the model)
spiderfit_nb <- boral(y, X = X, family = "negative.binomial",
ranef.ids = data.frame(region = rep(1:7,each=4)),
mcmc.control = example_mcmc_control, model.name = testpath)
spiderfit_nb$ranef.coefs.median
spiderfit_nb$ranef.sigma.median
## Example 3a - Extend example 2 to demonstrate grouped covariate selection
## on the last three covariates.
example_prior_control <- list(type = c("normal","normal","normal","uniform"),
ssvs.index = c(-1,-1,-1,1,2,3))
spiderfit_nb2 <- boral(y, X = X, family = "negative.binomial",
mcmc.control = example_mcmc_control,
prior.control = example_prior_control, model.name = testpath)
summary(spiderfit_nb2)
## Example 3b - Extend example 2 to demonstrate individual covariate selection
## on the last three covariates.
example_prior_control <- list(type = c("normal","normal","normal","uniform"),
ssvs.index = c(-1,-1,-1,0,0,0))
spiderfit_nb3 <- boral(y, X = X, family = "negative.binomial",
mcmc.control = example_mcmc_control, prior.control = example_prior_control,
model.name = testpath)
summary(spiderfit_nb3)
## Example 4 - model fitted to presence-absence data, no site effects, and
## two latent variables
data(tikus)
y <- tikus$abun
y[y > 0] <- 1
y <- y[1:20,] ## Consider only years 1981 and 1983
y <- y[,apply(y > 0,2,sum) > 2] ## Consider only spp with more than 2 presences
tikusfit <- boral(y, family = "binomial",
lv.control = list(num.lv = 2), mcmc.control = example_mcmc_control,
model.name = testpath)
lvsplot(tikusfit, biplot = FALSE)
## A strong location between the two sampling years
## Example 5a - model fitted to count data, no site effects, and
## two latent variables, plus traits included to explain environmental responses
data(antTraits)
y <- antTraits$abun
X <- as.matrix(scale(antTraits$env))
## Include only traits 1, 2, and 5
traits <- as.matrix(antTraits$traits[,c(1,2,5)])
example_which_traits <- vector("list",ncol(X)+1)
for(i in 1:length(example_which_traits))
example_which_traits[[i]] <- 1:ncol(traits)
## Just for fun, the regression coefficients for the second column of X,
## corresponding to the third element in the list example_which_traits,
## will be estimated separately and not regressed against traits.
example_which_traits[[3]] <- 0
fit_traits <- boral(y, X = X, traits = traits,
lv.control = list(num.lv = 2),
which.traits = example_which_traits, family = "negative.binomial",
mcmc.control = example_mcmc_control, model.name = testpath,
save.model = TRUE)
summary(fit_traits)
## Example 5b - perform selection on trait coefficients
ssvs_traitsindex <- vector("list",ncol(X)+1)
for(i in 1:length(ssvs_traitsindex))
ssvs_traitsindex[[i]] <- rep(0,ncol(traits))
ssvs_traitsindex[[3]] <- -1
fit_traits <- boral(y, X = X, traits = traits, which.traits = example_which_traits,
family = "negative.binomial",
lv.control = list(num.lv = 2), mcmc.control = example_mcmc_control,
save.model = TRUE, prior.control = list(ssvs.traitsindex = ssvs_traitsindex),
model.name = testpath)
summary(fit_traits)
## Example 6 - simulate Bernoulli data, based on a model with two latent variables,
## no site variables, with two traits and one environmental covariates
## This example is a proof of concept that traits can used to
## explain environmental responses
library(mvtnorm)
n <- 100; s <- 50
X <- as.matrix(scale(1:n))
colnames(X) <- c("elevation")
traits <- cbind(rbinom(s,1,0.5), rnorm(s))
## one categorical and one continuous variable
colnames(traits) <- c("thorns-dummy","SLA")
simfit <- list(true.lv = rmvnorm(n, mean = rep(0,2)),
lv.coefs = cbind(rnorm(s), rmvnorm(s, mean = rep(0,2))),
traits.coefs = matrix(c(0.1,1,-0.5,1,0.5,0,-1,1), 2, byrow = TRUE))
rownames(simfit$traits.coefs) <- c("beta0","elevation")
colnames(simfit$traits.coefs) <- c("kappa0","thorns-dummy","SLA","sigma")
simy = create.life(true.lv = simfit$true.lv, lv.coefs = simfit$lv.coefs, X = X,
traits = traits, traits.coefs = simfit$traits.coefs, family = "binomial")
example_which_traits <- vector("list",ncol(X)+1)
for(i in 1:length(example_which_traits))
example_which_traits[[i]] <- 1:ncol(traits)
fit_traits <- boral(y = simy, X = X, traits = traits,
which.traits = example_which_traits, family = "binomial",
lv.control = list(num.lv = 2), save.model = TRUE,
mcmc.control = example_mcmc_control, model.name = testpath)
## End(Not run)