R: Fitting scalable joint Species Distribution Models (sjSDM)

sjSDM {sjSDM}

R Documentation

Fitting scalable joint Species Distribution Models (sjSDM)

Description

sjSDM is used to fit joint Species Distribution models (jSDMs) using the central processing unit (CPU) or the graphical processing unit (GPU). The default is a multivariate probit model based on a Monte-Carlo approximation of the joint likelihood. sjSDM can be used to fit linear but also deep neural networks and supports the well known formula syntax.

Usage

sjSDM(
  Y = NULL,
  env = NULL,
  biotic = bioticStruct(),
  spatial = NULL,
  family = stats::binomial("probit"),
  iter = 100L,
  step_size = NULL,
  learning_rate = 0.01,
  se = FALSE,
  sampling = 100L,
  parallel = 0L,
  control = sjSDMControl(),
  device = "cpu",
  dtype = "float32",
  seed = 758341678
)

sjSDM.tune(object)

Arguments

`Y`	matrix of species occurrences/responses in range
`env`	matrix of environmental predictors, object of type `linear` or `DNN`
`biotic`	defines biotic (species-species associations) structure, object of type `bioticStruct`
`spatial`	defines spatial structure, object of type `linear` or `DNN`
`family`	error distribution with link function, see details for supported distributions
`iter`	number of fitting iterations
`step_size`	batch size for stochastic gradient descent, if `NULL` then step_size is set to: `step_size = 0.1*nrow(X)`
`learning_rate`	learning rate for Adamax optimizer
`se`	calculate standard errors for environmental coefficients
`sampling`	number of sampling steps for Monte Carlo integration
`parallel`	number of cpu cores for the data loader, only necessary for large datasets
`control`	control parameters for optimizer, see `sjSDMControl`
`device`	which device to be used, "cpu" or "gpu"
`dtype`	which data type, most GPUs support only 32 bit floats.
`seed`	seed for random operations
`object`	object of type `sjSDM_cv`

Details

The function fits per default a multivariate probit model via Monte-Carlo integration (see Chen et al., 2018) of the joint likelihood for all species.

Model description

The most common jSDM structure describes the site (\(i = 1, ..., I\)) by species (\(j = 1, ..., J\)) matrix \(Y_{ij}\) as a function of environmental covariates \(X_{in}\)(\(n=1,...,N\) covariates), and the species-species covariance matrix \(\Sigma\) accounts for correlations in \(e_{ij}\):

\[g(Z_{ij}) = \beta_{j0} + \Sigma^{N}_{n=1}X_{in}\beta_{nj} + e_{ij}\]

with \(g(.)\) as link function. For the multivariate probit model, the link function is:

\[Y_{ij}=1(Z_{ij} > 0)\]

The probability to observe the occurrence vector \(\bf{Y_i}\) is:

\[Pr(\bf{Y}_i|\bf{X}_i\beta, \Sigma) = \int_{A_{iJ}}...\int_{A_{i1}} \phi_J(\bf{Y}_i^{\ast};\bf{X}_i\beta, \Sigma) dY_{i1}^{\ast}... dY_{iJ}^{\ast}\]

in the interval \(A_{ij}\) with \((-\inf, 0]\) if \(Y_{ij}=0\) and \( [0, +\inf) \) if \(Y_{ij}=1\).

and \(\phi\) being the density function of the multivariate normal distribution.

The probability of \(\bf{Y_i}\) requires to integrate over \(\bf{Y_i^{\ast}}\) which has no closed analytical expression for more than two species which makes the evaluation of the likelihood computationally costly and needs a numerical approximation. The previous equation can be expressed more generally as:

\[ \mathcal{L}(\beta, \Sigma; \bf{Y}_i, \bf{X}_i) = \int_{\Omega} \prod_{j=1}^J Pr(Y_{ij}|\bf{X}_i\beta+\zeta) Pr(\zeta|\Sigma) d\zeta \]

sjSDM approximates this integral by \(M\) Monte-Carlo samples from the multivariate normal species-species covariance. After integrating out the covariance term, the remaining part of the likelihood can be calculated as in an univariate case and the average of the \(M\) samples are used to get an approximation of the integral:

\[ \mathcal{L}(\beta, \Sigma; \bf{Y}_i, \bf{X}_i) \approx \frac{1}{M} \Sigma_{m=1}^M \prod_{j=1}^J Pr(Y_{ij}|\bf{X}_i\beta+\zeta_m)\]

with \( \zeta_m \sim MVN(0, \Sigma)\).

sjSDM uses 'PyTorch' to run optionally the model on the graphical processing unit (GPU). Python dependencies needs to be installed before being able to use the sjSDM function. We provide a function which installs automatically python and the python dependencies. See install_sjSDM, vignette("Dependencies", package = "sjSDM")

See Pichler and Hartig, 2020 for benchmark results.

Supported distributions

Currently supported distributions and link functions:

binomial: "probit" or "logit"
poisson: "log"
"nbinom": "log"
gaussian: "identity"

Space

We can extend the model to account for spatial auto-correlation between the sites by:

\[g(Z_{ij}) = \beta_{j0} + \Sigma^{N}_{n=1}X_{in}\beta_{nj} + \Sigma^{M}_{m=1}S_{im}\alpha_{mj} + e_{ij}\]

There are two ways to generate spatial predictors \(S\):

trend surface model - using spatial coordinates in a polynomial:

linear(data=Coords, ~0+poly(X, Y, degree = 2))
eigenvector spatial filtering - using spatial eigenvectors. Spatial eigenvectors can be generated by the generateSpatialEV function:

SPV = generateSpatialEV(Coords)

Then we use, for example, the first 20 spatial eigenvectors:

linear(data=SPV[ ,1:20], ~0+.)

It is important to set the intercept to 0 in the spatial term (e.g. via ~0+.) because the intercept is already set in the environmental object.

Installation

install_sjSDM should be theoretically able to install conda and 'PyTorch' automatically. If sjSDM still does not work after reloading RStudio, you can try to solve this on your following our trouble shooting guide installation_help. If the problem remains, please create an issue on issue tracker with a copy of the install_diagnostic output as a quote.

Value

An S3 class of type 'sjSDM' including the following components:

`cl`	Model call
`formula`	Formula object for environmental covariates.
`names`	Names of environmental covariates.
`species`	Names of species (can be `NULL` if columns of Y are not named).
`get_model`	Method which builds and returns the underlying 'python' model.
`logLik`	negative log-Likelihood of the model and the regularization loss.
`model`	The actual model.
`settings`	List of model settings, see arguments of `sjSDM`.
`family`	Response family.
`time`	Runtime.
`data`	List of Y, X (and spatial) model matrices.
`sessionInfo`	Output of `sessionInfo`.
`weights`	List of model coefficients (environmental (and spatial)).
`sigma`	Lower triangular weight matrix for the covariance matrix.
`history`	History of iteration losses.
`se`	Matrix of standard errors, if `se = FALSE` the field 'se' is `NULL`.

Implemented S3 methods include summary.sjSDM, plot.sjSDM, print.sjSDM, predict.sjSDM, and coef.sjSDM. For other methods, see section 'See Also'.

sjSDM.tune returns an S3 object of class 'sjSDM', see above for information about values.

Author(s)

Maximilian Pichler

References

Chen, D., Xue, Y., & Gomes, C. P. (2018). End-to-end learning for the deep multivariate probit model. arXiv preprint arXiv:1803.08591.

Pichler, M., & Hartig, F. (2021). A new joint species distribution model for faster and more accurate inference of species associations from big community data. Methods in Ecology and Evolution, 12(11), 2159-2173.

Examples

## Not run: 
  
# Basic workflow:
## simulate community:
com = simulate_SDM(env = 3L, species = 7L, sites = 100L)

## fit model:
model = sjSDM(Y = com$response,env = com$env_weights, iter = 50L) 
# increase iter for your own data 

coef(model)
summary(model)
getCov(model)

## plot results
species=c("sp1","sp2","sp3","sp4","sp5","sp6","sp7")
group=c("mammal","bird","fish","fish","mammal","amphibian","amphibian")
group = data.frame(species=species,group=group)
plot(model,group=group)

## calculate post-hoc p-values:
p = getSe(model)
summary(p)

## or turn on the option in the sjSDM function:
model = sjSDM(Y = com$response, env = com$env_weights, se = TRUE, 
              family = binomial("probit"), 
              iter = 2L)
summary(model)

## fit model with interactions:
model = sjSDM(Y = com$response,
              env = linear(data = com$env_weights, formula = ~X1:X2 + X3), 
              se = TRUE,
              iter = 2L) # increase iter for your own data 
summary(model)

## without intercept:
model = update(model, env_formula = ~0+X1:X2 + X3)

summary(model)

## predict with model:
preds = predict(model, newdata = com$env_weights)

## calculate R-squared:
R2 = Rsquared(model)
print(R2)

# With spatial terms:
## linear spatial model
XY = matrix(rnorm(200), 100, 2)
model = sjSDM(Y = com$response, env = linear(com$env_weights), 
              spatial = linear(XY, ~0+X1:X2),
              iter = 50L) # increase iter for your own data 
summary(model)
predict(model, newdata = com$env_weights, SP = XY)
R2 = Rsquared(model)
print(R2)

## Using spatial eigenvectors as predictors to account 
## for spatial autocorrelation is a common approach:
SPV = generateSpatialEV(XY)
model = sjSDM(Y = com$response, env = linear(com$env_weights), 
              spatial = linear(SPV, ~0+., lambda = 0.1),
              iter = 50L) # increase iter for your own data 
summary(model)
predict(model, newdata = com$env_weights, SP = SPV)

## Visualize internal meta-community structure
an = anova(model)
plot(an, internal=TRUE)

## non-linear(deep neural network) model
model = sjSDM(Y = com$response, env = linear(com$env_weights), 
              spatial = DNN(SPV,hidden = c(5L, 5L), ~0+.),
              iter = 2L) # increase iter for your own data 
summary(model)
predict(model, newdata = com$env_weights, SP = SPV)


# Regularization
## lambda is the regularization strength
## alpha weights the lasso or ridge penalty:
## - alpha = 0 --> pure lasso
## - alpha = 1.0 --> pure ridge
model = sjSDM(Y = com$response, 
              # mix of lasso and ridge
              env = linear(com$env_weights, lambda = 0.01, alpha = 0.5), 
              # we can do the same for the species-species associations
              biotic = bioticStruct(lambda = 0.01, alpha = 0.5),
              iter = 2L) # increase iter for your own data 
summary(model)
coef(model)
getCov(model)



# Anova 
com = simulate_SDM(env = 3L, species = 15L, sites = 200L, correlation = TRUE)

XY = matrix(rnorm(400), 200, 2)
SPV = generateSpatialEV(XY)
model = sjSDM(Y = com$response, env = linear(com$env_weights), 
              spatial = linear(SPV, ~0+.), 
              iter = 50L) # increase iter for your own data 
result = anova(model)
print(result)
plot(result)

## visualize meta-community structure
plot(result, internal=TRUE)


# Deep neural network
## we can fit also a deep neural network instead of a linear model:
model = sjSDM(Y = com$response,
              env = DNN(com$env_weights, hidden = c(10L, 10L, 10L)),
              iter = 2L) # increase iter for your own data 
summary(model)
getCov(model)
pred = predict(model, newdata = com$env_weights)

## extract weights
weights = getWeights(model)

## we can also assign weights:
setWeights(model, weights)

## with regularization:
model = sjSDM(Y = com$response, 
              # mix of lasso and ridge
              env = DNN(com$env_weights, lambda = 0.01, alpha = 0.5), 
              # we can do the same for the species-species associations
              biotic = bioticStruct(lambda = 0.01, alpha = 0.5),
              iter = 2L) # increase iter for your own data 
getCov(model)
getWeights(model)

## End(Not run)

[Package sjSDM version 1.0.5 Index]