fit_hglm_occupancy_models {surveyvoi}R Documentation

Fit hierarchical generalized linear models to predict occupancy

Description

Estimate probability of occupancy for a set of features in a set of planning units. Models are fitted as hierarchical generalized linear models that account for for imperfect detection (following Royle & Link 2006) using JAGS (via runjags::run.jags()). To limit over-fitting, covariate coefficients are sampled using a Laplace prior distribution (equivalent to L1 regularization used in machine learning contexts) (Park & Casella 2008).

Usage

fit_hglm_occupancy_models(
  site_data,
  feature_data,
  site_detection_columns,
  site_n_surveys_columns,
  site_env_vars_columns,
  feature_survey_sensitivity_column,
  feature_survey_specificity_column,
  jags_n_samples = rep(10000, length(site_detection_columns)),
  jags_n_burnin = rep(1000, length(site_detection_columns)),
  jags_n_thin = rep(100, length(site_detection_columns)),
  jags_n_adapt = rep(1000, length(site_detection_columns)),
  jags_n_chains = rep(4, length(site_detection_columns)),
  n_folds = rep(5, length(site_detection_columns)),
  n_threads = 1,
  seed = 500,
  verbose = FALSE
)

Arguments

site_data

sf::sf() object with site data.

feature_data

base::data.frame() object with feature data.

site_detection_columns

character names of numeric columns in the argument to site_data that contain the proportion of surveys conducted within each site that detected each feature. Each column should correspond to a different feature, and contain a proportion value (between zero and one). If a site has not previously been surveyed, a value of zero should be used.

site_n_surveys_columns

character names of numeric columns in the argument to site_data that contain the total number of surveys conducted for each each feature within each site. Each column should correspond to a different feature, and contain a non-negative integer number (e.g. 0, 1, 2, 3). If a site has not previously been surveyed, a value of zero should be used.

site_env_vars_columns

character names of columns in the argument to site_data that contain environmental information for fitting updated occupancy models based on possible survey outcomes. Each column should correspond to a different environmental variable, and contain numeric, factor, or character data. No missing (NA) values are permitted in these columns.

feature_survey_sensitivity_column

character name of the column in the argument to feature_data that contains probability of future surveys correctly detecting a presence of each feature in a given site (i.e. the sensitivity of the survey methodology). This column should have numeric values that are between zero and one. No missing (NA) values are permitted in this column.

feature_survey_specificity_column

character name of the column in the argument to feature_data that contains probability of future surveys correctly detecting an absence of each feature in a given site (i.e. the specificity of the survey methodology). This column should have numeric values that are between zero and one. No missing (NA) values are permitted in this column.

jags_n_samples

integer number of sample to generate per chain for MCMC analyses. See documentation for the sample parameter in runjags::run.jags() for more information). Defaults to 10,000 for each feature.

jags_n_burnin

integer number of warm up iterations for MCMC analyses. See documentation for the burnin parameter in runjags::run.jags() for more information). Defaults to 10,000 for each feature.

jags_n_thin

integer number of thinning iterations for MCMC analyses. See documentation for the thin parameter in runjags::run.jags() for more information). Defaults to 100 for each feature.

jags_n_adapt

integer number of adapting iterations for MCMC analyses. See documentation for the adapt parameter in runjags::run.jags() for more information). Defaults to 1,000 for each feature.

jags_n_chains

integer total number of chains for MCMC analyses. See documentation for the n.chains parameter in runjags::run.jags() for more information). Defaults to 4 for each fold for each feature.

n_folds

numeric number of folds to split the training data into when fitting models for each feature. Defaults to 5 for each feature.

n_threads

integer number of threads to use for parameter tuning. Defaults to 1.

seed

integer initial random number generator state for model fitting. Defaults to 500.

verbose

logical indicating if information should be printed during computations. Defaults to FALSE.

Details

This function (i) prepares the data for model fitting, (ii) fits the models, and (iii) assesses the performance of the models. These analyses are performed separately for each feature. For a given feature:

  1. The data are prepared for model fitting by partitioning the data using k-fold cross-validation (set via argument to n_folds). The training and evaluation folds are constructed in such a manner as to ensure that each training and evaluation fold contains at least one presence and one absence observation.

  2. A model for fit separately for each fold (see inst/jags/model.jags for model code). To assess convergence, the multi-variate potential scale reduction factor (MPSRF) statistic is calculated for each model.

  3. The performance of the cross-validation models is evaluated. Specifically, the TSS, sensitivity, and specificity statistics are calculated (if relevant, weighted by the argument to site_weights_data). These performance values are calculated using the models' training and evaluation folds. To assess convergence, the maximum MPSRF statistic for the models fit for each feature is calculated.

Value

A list object containing:

models

list of list objects containing the models.

predictions

tibble::tibble() object containing predictions for each feature.

performance

tibble::tibble() object containing the performance of the best models for each feature. It contains the following columns:

feature

name of the feature.

max_mpsrf

maximum multi-variate potential scale reduction factor (MPSRF) value for the models. A MPSRF value less than 1.05 means that all coefficients in a given model have converged, and so a value less than 1.05 in this column means that all the models fit for a given feature have successfully converged.

train_tss_mean

mean TSS statistic for models calculated using training data in cross-validation.

train_tss_std

standard deviation in TSS statistics for models calculated using training data in cross-validation.

train_sensitivity_mean

mean sensitivity statistic for models calculated using training data in cross-validation.

train_sensitivity_std

standard deviation in sensitivity statistics for models calculated using training data in cross-validation.

train_specificity_mean

mean specificity statistic for models calculated using training data in cross-validation.

train_specificity_std

standard deviation in specificity statistics for models calculated using training data in cross-validation.

test_tss_mean

mean TSS statistic for models calculated using test data in cross-validation.

test_tss_std

standard deviation in TSS statistics for models calculated using test data in cross-validation.

test_sensitivity_mean

mean sensitivity statistic for models calculated using test data in cross-validation.

test_sensitivity_std

standard deviation in sensitivity statistics for models calculated using test data in cross-validation.

test_specificity_mean

mean specificity statistic for models calculated using test data in cross-validation.

test_specificity_std

standard deviation in specificity statistics for models calculated using test data in cross-validation.

Dependencies

This function requires the JAGS software to be installed. For information on installing the JAGS software, please consult the documentation for the rjags package.

References

Park T & Casella G (2008) The Bayesian lasso. Journal of the American Statistical Association, 103: 681–686.

Royle JA & Link WA (2006) Generalized site occupancy models allowing for false positive and false negative errors. Ecology, 87: 835–841.

Examples

## Not run: 
# set seeds for reproducibility
set.seed(123)

# simulate data for 200 sites, 2 features, and 3 environmental variables
site_data <- simulate_site_data(n_sites = 30, n_features = 2, prop = 0.1)
feature_data <- simulate_feature_data(n_features = 2, prop = 1)

# print JAGS model code
cat(readLines(system.file("jags", "model.jags", package = "surveyvoi")),
    sep = "\n")

# fit models
# note that we use a small number of MCMC iterations so that the example
# finishes quickly, you probably want to use the defaults for real work
results <- fit_hglm_occupancy_models(
   site_data, feature_data,
   c("f1", "f2"), c("n1", "n2"), c("e1", "e2", "e3"),
   "survey_sensitivity", "survey_specificity",
   n_folds = rep(5, 2),
   jags_n_samples = rep(250, 2), jags_n_burnin = rep(250, 2),
   jags_n_thin = rep(1, 2), jags_n_adapt = rep(100, 2),
   n_threads = 1)

# print model predictions
print(results$predictions)

# print model performance
print(results$performance, width = Inf)

## End(Not run)

[Package surveyvoi version 1.0.6 Index]