sjSDM {sjSDM} | R Documentation |
Fitting scalable joint Species Distribution Models (sjSDM)
Description
sjSDM
is used to fit joint Species Distribution models (jSDMs) using the central processing unit (CPU) or the graphical processing unit (GPU).
The default is a multivariate probit model based on a Monte-Carlo approximation of the joint likelihood.
sjSDM
can be used to fit linear but also deep neural networks and supports the well known formula syntax.
Usage
sjSDM(
Y = NULL,
env = NULL,
biotic = bioticStruct(),
spatial = NULL,
family = stats::binomial("probit"),
iter = 100L,
step_size = NULL,
learning_rate = 0.01,
se = FALSE,
sampling = 100L,
parallel = 0L,
control = sjSDMControl(),
device = "cpu",
dtype = "float32",
seed = 758341678
)
sjSDM.tune(object)
Arguments
Y |
matrix of species occurrences/responses in range |
env |
matrix of environmental predictors, object of type |
biotic |
defines biotic (species-species associations) structure, object of type |
spatial |
|
family |
error distribution with link function, see details for supported distributions |
iter |
number of fitting iterations |
step_size |
batch size for stochastic gradient descent, if |
learning_rate |
learning rate for Adamax optimizer |
se |
calculate standard errors for environmental coefficients |
sampling |
number of sampling steps for Monte Carlo integration |
parallel |
number of cpu cores for the data loader, only necessary for large datasets |
control |
control parameters for optimizer, see |
device |
which device to be used, "cpu" or "gpu" |
dtype |
which data type, most GPUs support only 32 bit floats. |
seed |
seed for random operations |
object |
object of type |
Details
The function fits per default a multivariate probit model via Monte-Carlo integration (see Chen et al., 2018) of the joint likelihood for all species.
Model description
The most common jSDM structure describes the site (\(i = 1, ..., I\)) by species (\(j = 1, ..., J\)) matrix \(Y_{ij}\) as a function of environmental covariates \(X_{in}\)(\(n=1,...,N\) covariates), and the species-species covariance matrix \(\Sigma\) accounts for correlations in \(e_{ij}\):
\[g(Z_{ij}) = \beta_{j0} + \Sigma^{N}_{n=1}X_{in}\beta_{nj} + e_{ij}\]with \(g(.)\) as link function. For the multivariate probit model, the link function is:
\[Y_{ij}=1(Z_{ij} > 0)\]The probability to observe the occurrence vector \(\bf{Y_i}\) is:
\[Pr(\bf{Y}_i|\bf{X}_i\beta, \Sigma) = \int_{A_{iJ}}...\int_{A_{i1}} \phi_J(\bf{Y}_i^{\ast};\bf{X}_i\beta, \Sigma) dY_{i1}^{\ast}... dY_{iJ}^{\ast}\]in the interval \(A_{ij}\) with \((-\inf, 0]\) if \(Y_{ij}=0\) and \( [0, +\inf) \) if \(Y_{ij}=1\).
and \(\phi\) being the density function of the multivariate normal distribution.
The probability of \(\bf{Y_i}\) requires to integrate over \(\bf{Y_i^{\ast}}\) which has no closed analytical expression for more than two species which makes the evaluation of the likelihood computationally costly and needs a numerical approximation. The previous equation can be expressed more generally as:
\[ \mathcal{L}(\beta, \Sigma; \bf{Y}_i, \bf{X}_i) = \int_{\Omega} \prod_{j=1}^J Pr(Y_{ij}|\bf{X}_i\beta+\zeta) Pr(\zeta|\Sigma) d\zeta \]sjSDM
approximates this integral by \(M\) Monte-Carlo samples from the multivariate normal species-species covariance.
After integrating out the covariance term, the remaining part of the likelihood can be calculated as in an univariate case and the average
of the \(M\) samples are used to get an approximation of the integral:
with \( \zeta_m \sim MVN(0, \Sigma)\).
sjSDM
uses 'PyTorch' to run optionally the model on the graphical processing unit (GPU). Python dependencies needs to be
installed before being able to use the sjSDM
function. We provide a function which installs automatically python and the python dependencies.
See install_sjSDM
, vignette("Dependencies", package = "sjSDM")
See Pichler and Hartig, 2020 for benchmark results.
Supported distributions
Currently supported distributions and link functions:
Space
We can extend the model to account for spatial auto-correlation between the sites by:
\[g(Z_{ij}) = \beta_{j0} + \Sigma^{N}_{n=1}X_{in}\beta_{nj} + \Sigma^{M}_{m=1}S_{im}\alpha_{mj} + e_{ij}\]There are two ways to generate spatial predictors \(S\):
trend surface model - using spatial coordinates in a polynomial:
linear(data=Coords, ~0+poly(X, Y, degree = 2))
eigenvector spatial filtering - using spatial eigenvectors. Spatial eigenvectors can be generated by the
generateSpatialEV
function:SPV = generateSpatialEV(Coords)
Then we use, for example, the first 20 spatial eigenvectors:
linear(data=SPV[ ,1:20], ~0+.)
It is important to set the intercept to 0 in the spatial term (e.g. via ~0+.
) because the intercept is already set in the environmental object.
Installation
install_sjSDM
should be theoretically able to install conda and 'PyTorch' automatically. If sjSDM
still does not work after reloading RStudio, you can try to solve this on your following our trouble shooting guide installation_help
.
If the problem remains, please create an issue on issue tracker with a copy of the install_diagnostic
output as a quote.
Value
An S3 class of type 'sjSDM' including the following components:
cl |
Model call |
formula |
Formula object for environmental covariates. |
names |
Names of environmental covariates. |
species |
Names of species (can be |
get_model |
Method which builds and returns the underlying 'python' model. |
logLik |
negative log-Likelihood of the model and the regularization loss. |
model |
The actual model. |
settings |
List of model settings, see arguments of |
family |
Response family. |
time |
Runtime. |
data |
List of Y, X (and spatial) model matrices. |
sessionInfo |
Output of |
weights |
List of model coefficients (environmental (and spatial)). |
sigma |
Lower triangular weight matrix for the covariance matrix. |
history |
History of iteration losses. |
se |
Matrix of standard errors, if |
Implemented S3 methods include summary.sjSDM
, plot.sjSDM
, print.sjSDM
, predict.sjSDM
, and coef.sjSDM
. For other methods, see section 'See Also'.
sjSDM.tune
returns an S3 object of class 'sjSDM', see above for information about values.
Author(s)
Maximilian Pichler
References
Chen, D., Xue, Y., & Gomes, C. P. (2018). End-to-end learning for the deep multivariate probit model. arXiv preprint arXiv:1803.08591.
Pichler, M., & Hartig, F. (2021). A new joint species distribution model for faster and more accurate inference of species associations from big community data. Methods in Ecology and Evolution, 12(11), 2159-2173.
See Also
getCor
, getCov
, update.sjSDM
, sjSDM_cv
, DNN
, plot.sjSDM
, print.sjSDM
, predict.sjSDM
, coef.sjSDM
, summary.sjSDM
, simulate.sjSDM
, getSe
, anova.sjSDM
, importance
Examples
## Not run:
# Basic workflow:
## simulate community:
com = simulate_SDM(env = 3L, species = 7L, sites = 100L)
## fit model:
model = sjSDM(Y = com$response,env = com$env_weights, iter = 50L)
# increase iter for your own data
coef(model)
summary(model)
getCov(model)
## plot results
species=c("sp1","sp2","sp3","sp4","sp5","sp6","sp7")
group=c("mammal","bird","fish","fish","mammal","amphibian","amphibian")
group = data.frame(species=species,group=group)
plot(model,group=group)
## calculate post-hoc p-values:
p = getSe(model)
summary(p)
## or turn on the option in the sjSDM function:
model = sjSDM(Y = com$response, env = com$env_weights, se = TRUE,
family = binomial("probit"),
iter = 2L)
summary(model)
## fit model with interactions:
model = sjSDM(Y = com$response,
env = linear(data = com$env_weights, formula = ~X1:X2 + X3),
se = TRUE,
iter = 2L) # increase iter for your own data
summary(model)
## without intercept:
model = update(model, env_formula = ~0+X1:X2 + X3)
summary(model)
## predict with model:
preds = predict(model, newdata = com$env_weights)
## calculate R-squared:
R2 = Rsquared(model)
print(R2)
# With spatial terms:
## linear spatial model
XY = matrix(rnorm(200), 100, 2)
model = sjSDM(Y = com$response, env = linear(com$env_weights),
spatial = linear(XY, ~0+X1:X2),
iter = 50L) # increase iter for your own data
summary(model)
predict(model, newdata = com$env_weights, SP = XY)
R2 = Rsquared(model)
print(R2)
## Using spatial eigenvectors as predictors to account
## for spatial autocorrelation is a common approach:
SPV = generateSpatialEV(XY)
model = sjSDM(Y = com$response, env = linear(com$env_weights),
spatial = linear(SPV, ~0+., lambda = 0.1),
iter = 50L) # increase iter for your own data
summary(model)
predict(model, newdata = com$env_weights, SP = SPV)
## Visualize internal meta-community structure
an = anova(model)
plot(an, internal=TRUE)
## non-linear(deep neural network) model
model = sjSDM(Y = com$response, env = linear(com$env_weights),
spatial = DNN(SPV,hidden = c(5L, 5L), ~0+.),
iter = 2L) # increase iter for your own data
summary(model)
predict(model, newdata = com$env_weights, SP = SPV)
# Regularization
## lambda is the regularization strength
## alpha weights the lasso or ridge penalty:
## - alpha = 0 --> pure lasso
## - alpha = 1.0 --> pure ridge
model = sjSDM(Y = com$response,
# mix of lasso and ridge
env = linear(com$env_weights, lambda = 0.01, alpha = 0.5),
# we can do the same for the species-species associations
biotic = bioticStruct(lambda = 0.01, alpha = 0.5),
iter = 2L) # increase iter for your own data
summary(model)
coef(model)
getCov(model)
# Anova
com = simulate_SDM(env = 3L, species = 15L, sites = 200L, correlation = TRUE)
XY = matrix(rnorm(400), 200, 2)
SPV = generateSpatialEV(XY)
model = sjSDM(Y = com$response, env = linear(com$env_weights),
spatial = linear(SPV, ~0+.),
iter = 50L) # increase iter for your own data
result = anova(model)
print(result)
plot(result)
## visualize meta-community structure
plot(result, internal=TRUE)
# Deep neural network
## we can fit also a deep neural network instead of a linear model:
model = sjSDM(Y = com$response,
env = DNN(com$env_weights, hidden = c(10L, 10L, 10L)),
iter = 2L) # increase iter for your own data
summary(model)
getCov(model)
pred = predict(model, newdata = com$env_weights)
## extract weights
weights = getWeights(model)
## we can also assign weights:
setWeights(model, weights)
## with regularization:
model = sjSDM(Y = com$response,
# mix of lasso and ridge
env = DNN(com$env_weights, lambda = 0.01, alpha = 0.5),
# we can do the same for the species-species associations
biotic = bioticStruct(lambda = 0.01, alpha = 0.5),
iter = 2L) # increase iter for your own data
getCov(model)
getWeights(model)
## End(Not run)