sjSDM {sjSDM}R Documentation

Fitting scalable joint Species Distribution Models (sjSDM)

Description

sjSDM is used to fit joint Species Distribution models (jSDMs) using the central processing unit (CPU) or the graphical processing unit (GPU). The default is a multivariate probit model based on a Monte-Carlo approximation of the joint likelihood. sjSDM can be used to fit linear but also deep neural networks and supports the well known formula syntax.

Usage

sjSDM(
  Y = NULL,
  env = NULL,
  biotic = bioticStruct(),
  spatial = NULL,
  family = stats::binomial("probit"),
  iter = 100L,
  step_size = NULL,
  learning_rate = 0.01,
  se = FALSE,
  sampling = 100L,
  parallel = 0L,
  control = sjSDMControl(),
  device = "cpu",
  dtype = "float32",
  seed = 758341678
)

sjSDM.tune(object)

Arguments

Y

matrix of species occurrences/responses in range

env

matrix of environmental predictors, object of type linear or DNN

biotic

defines biotic (species-species associations) structure, object of type bioticStruct

spatial

defines spatial structure, object of type linear or DNN

family

error distribution with link function, see details for supported distributions

iter

number of fitting iterations

step_size

batch size for stochastic gradient descent, if NULL then step_size is set to: step_size = 0.1*nrow(X)

learning_rate

learning rate for Adamax optimizer

se

calculate standard errors for environmental coefficients

sampling

number of sampling steps for Monte Carlo integration

parallel

number of cpu cores for the data loader, only necessary for large datasets

control

control parameters for optimizer, see sjSDMControl

device

which device to be used, "cpu" or "gpu"

dtype

which data type, most GPUs support only 32 bit floats.

seed

seed for random operations

object

object of type sjSDM_cv

Details

The function fits per default a multivariate probit model via Monte-Carlo integration (see Chen et al., 2018) of the joint likelihood for all species.

Model description

The most common jSDM structure describes the site (\(i = 1, ..., I\)) by species (\(j = 1, ..., J\)) matrix \(Y_{ij}\) as a function of environmental covariates \(X_{in}\)(\(n=1,...,N\) covariates), and the species-species covariance matrix \(\Sigma\) accounts for correlations in \(e_{ij}\):

\[g(Z_{ij}) = \beta_{j0} + \Sigma^{N}_{n=1}X_{in}\beta_{nj} + e_{ij}\]

with \(g(.)\) as link function. For the multivariate probit model, the link function is:

\[Y_{ij}=1(Z_{ij} > 0)\]

The probability to observe the occurrence vector \(\bf{Y_i}\) is:

\[Pr(\bf{Y}_i|\bf{X}_i\beta, \Sigma) = \int_{A_{iJ}}...\int_{A_{i1}} \phi_J(\bf{Y}_i^{\ast};\bf{X}_i\beta, \Sigma) dY_{i1}^{\ast}... dY_{iJ}^{\ast}\]

in the interval \(A_{ij}\) with \((-\inf, 0]\) if \(Y_{ij}=0\) and \( [0, +\inf) \) if \(Y_{ij}=1\).

and \(\phi\) being the density function of the multivariate normal distribution.

The probability of \(\bf{Y_i}\) requires to integrate over \(\bf{Y_i^{\ast}}\) which has no closed analytical expression for more than two species which makes the evaluation of the likelihood computationally costly and needs a numerical approximation. The previous equation can be expressed more generally as:

\[ \mathcal{L}(\beta, \Sigma; \bf{Y}_i, \bf{X}_i) = \int_{\Omega} \prod_{j=1}^J Pr(Y_{ij}|\bf{X}_i\beta+\zeta) Pr(\zeta|\Sigma) d\zeta \]

sjSDM approximates this integral by \(M\) Monte-Carlo samples from the multivariate normal species-species covariance. After integrating out the covariance term, the remaining part of the likelihood can be calculated as in an univariate case and the average of the \(M\) samples are used to get an approximation of the integral:

\[ \mathcal{L}(\beta, \Sigma; \bf{Y}_i, \bf{X}_i) \approx \frac{1}{M} \Sigma_{m=1}^M \prod_{j=1}^J Pr(Y_{ij}|\bf{X}_i\beta+\zeta_m)\]

with \( \zeta_m \sim MVN(0, \Sigma)\).

sjSDM uses 'PyTorch' to run optionally the model on the graphical processing unit (GPU). Python dependencies needs to be installed before being able to use the sjSDM function. We provide a function which installs automatically python and the python dependencies. See install_sjSDM, vignette("Dependencies", package = "sjSDM")

See Pichler and Hartig, 2020 for benchmark results.

Supported distributions

Currently supported distributions and link functions:

Space

We can extend the model to account for spatial auto-correlation between the sites by:

\[g(Z_{ij}) = \beta_{j0} + \Sigma^{N}_{n=1}X_{in}\beta_{nj} + \Sigma^{M}_{m=1}S_{im}\alpha_{mj} + e_{ij}\]

There are two ways to generate spatial predictors \(S\):

It is important to set the intercept to 0 in the spatial term (e.g. via ~0+.) because the intercept is already set in the environmental object.

Installation

install_sjSDM should be theoretically able to install conda and 'PyTorch' automatically. If sjSDM still does not work after reloading RStudio, you can try to solve this on your following our trouble shooting guide installation_help. If the problem remains, please create an issue on issue tracker with a copy of the install_diagnostic output as a quote.

Value

An S3 class of type 'sjSDM' including the following components:

cl

Model call

formula

Formula object for environmental covariates.

names

Names of environmental covariates.

species

Names of species (can be NULL if columns of Y are not named).

get_model

Method which builds and returns the underlying 'python' model.

logLik

negative log-Likelihood of the model and the regularization loss.

model

The actual model.

settings

List of model settings, see arguments of sjSDM.

family

Response family.

time

Runtime.

data

List of Y, X (and spatial) model matrices.

sessionInfo

Output of sessionInfo.

weights

List of model coefficients (environmental (and spatial)).

sigma

Lower triangular weight matrix for the covariance matrix.

history

History of iteration losses.

se

Matrix of standard errors, if se = FALSE the field 'se' is NULL.

Implemented S3 methods include summary.sjSDM, plot.sjSDM, print.sjSDM, predict.sjSDM, and coef.sjSDM. For other methods, see section 'See Also'.

sjSDM.tune returns an S3 object of class 'sjSDM', see above for information about values.

Author(s)

Maximilian Pichler

References

Chen, D., Xue, Y., & Gomes, C. P. (2018). End-to-end learning for the deep multivariate probit model. arXiv preprint arXiv:1803.08591.

Pichler, M., & Hartig, F. (2021). A new joint species distribution model for faster and more accurate inference of species associations from big community data. Methods in Ecology and Evolution, 12(11), 2159-2173.

See Also

getCor, getCov, update.sjSDM, sjSDM_cv, DNN, plot.sjSDM, print.sjSDM, predict.sjSDM, coef.sjSDM, summary.sjSDM, simulate.sjSDM, getSe, anova.sjSDM, importance

Examples

## Not run: 
  
# Basic workflow:
## simulate community:
com = simulate_SDM(env = 3L, species = 7L, sites = 100L)

## fit model:
model = sjSDM(Y = com$response,env = com$env_weights, iter = 50L) 
# increase iter for your own data 

coef(model)
summary(model)
getCov(model)

## plot results
species=c("sp1","sp2","sp3","sp4","sp5","sp6","sp7")
group=c("mammal","bird","fish","fish","mammal","amphibian","amphibian")
group = data.frame(species=species,group=group)
plot(model,group=group)

## calculate post-hoc p-values:
p = getSe(model)
summary(p)

## or turn on the option in the sjSDM function:
model = sjSDM(Y = com$response, env = com$env_weights, se = TRUE, 
              family = binomial("probit"), 
              iter = 2L)
summary(model)

## fit model with interactions:
model = sjSDM(Y = com$response,
              env = linear(data = com$env_weights, formula = ~X1:X2 + X3), 
              se = TRUE,
              iter = 2L) # increase iter for your own data 
summary(model)

## without intercept:
model = update(model, env_formula = ~0+X1:X2 + X3)

summary(model)

## predict with model:
preds = predict(model, newdata = com$env_weights)

## calculate R-squared:
R2 = Rsquared(model)
print(R2)

# With spatial terms:
## linear spatial model
XY = matrix(rnorm(200), 100, 2)
model = sjSDM(Y = com$response, env = linear(com$env_weights), 
              spatial = linear(XY, ~0+X1:X2),
              iter = 50L) # increase iter for your own data 
summary(model)
predict(model, newdata = com$env_weights, SP = XY)
R2 = Rsquared(model)
print(R2)

## Using spatial eigenvectors as predictors to account 
## for spatial autocorrelation is a common approach:
SPV = generateSpatialEV(XY)
model = sjSDM(Y = com$response, env = linear(com$env_weights), 
              spatial = linear(SPV, ~0+., lambda = 0.1),
              iter = 50L) # increase iter for your own data 
summary(model)
predict(model, newdata = com$env_weights, SP = SPV)

## Visualize internal meta-community structure
an = anova(model)
plot(an, internal=TRUE)

## non-linear(deep neural network) model
model = sjSDM(Y = com$response, env = linear(com$env_weights), 
              spatial = DNN(SPV,hidden = c(5L, 5L), ~0+.),
              iter = 2L) # increase iter for your own data 
summary(model)
predict(model, newdata = com$env_weights, SP = SPV)


# Regularization
## lambda is the regularization strength
## alpha weights the lasso or ridge penalty:
## - alpha = 0 --> pure lasso
## - alpha = 1.0 --> pure ridge
model = sjSDM(Y = com$response, 
              # mix of lasso and ridge
              env = linear(com$env_weights, lambda = 0.01, alpha = 0.5), 
              # we can do the same for the species-species associations
              biotic = bioticStruct(lambda = 0.01, alpha = 0.5),
              iter = 2L) # increase iter for your own data 
summary(model)
coef(model)
getCov(model)



# Anova 
com = simulate_SDM(env = 3L, species = 15L, sites = 200L, correlation = TRUE)

XY = matrix(rnorm(400), 200, 2)
SPV = generateSpatialEV(XY)
model = sjSDM(Y = com$response, env = linear(com$env_weights), 
              spatial = linear(SPV, ~0+.), 
              iter = 50L) # increase iter for your own data 
result = anova(model)
print(result)
plot(result)

## visualize meta-community structure
plot(result, internal=TRUE)


# Deep neural network
## we can fit also a deep neural network instead of a linear model:
model = sjSDM(Y = com$response,
              env = DNN(com$env_weights, hidden = c(10L, 10L, 10L)),
              iter = 2L) # increase iter for your own data 
summary(model)
getCov(model)
pred = predict(model, newdata = com$env_weights)

## extract weights
weights = getWeights(model)

## we can also assign weights:
setWeights(model, weights)

## with regularization:
model = sjSDM(Y = com$response, 
              # mix of lasso and ridge
              env = DNN(com$env_weights, lambda = 0.01, alpha = 0.5), 
              # we can do the same for the species-species associations
              biotic = bioticStruct(lambda = 0.01, alpha = 0.5),
              iter = 2L) # increase iter for your own data 
getCov(model)
getWeights(model)

## End(Not run)

[Package sjSDM version 1.0.5 Index]