fit_xgb_occupancy_models {surveyvoi} | R Documentation |
Fit boosted regression tree models to predict occupancy
Description
Estimate probability of occupancy for a set of features in a set of
planning units. Models are fitted using gradient boosted trees (via
xgboost::xgb.train()
).
Usage
fit_xgb_occupancy_models(
site_data,
feature_data,
site_detection_columns,
site_n_surveys_columns,
site_env_vars_columns,
feature_survey_sensitivity_column,
feature_survey_specificity_column,
xgb_tuning_parameters,
xgb_early_stopping_rounds = rep(20, length(site_detection_columns)),
xgb_n_rounds = rep(100, length(site_detection_columns)),
n_folds = rep(5, length(site_detection_columns)),
n_threads = 1,
seed = 500,
verbose = FALSE
)
Arguments
site_data |
|
feature_data |
|
site_detection_columns |
|
site_n_surveys_columns |
|
site_env_vars_columns |
|
feature_survey_sensitivity_column |
|
feature_survey_specificity_column |
|
xgb_tuning_parameters |
|
xgb_early_stopping_rounds |
|
xgb_n_rounds |
|
n_folds |
|
n_threads |
|
seed |
|
verbose |
|
Details
This function (i) prepares the data for model fitting, (ii) calibrates
the tuning parameters for model fitting (see xgboost::xgb.train()
for details on tuning parameters), (iii) generate predictions using
the best found tuning parameters, and (iv) assess the performance of the
best supported models. These analyses are performed separately for each
feature. For a given feature:
The data are prepared for model fitting by partitioning the data using k-fold cross-validation (set via argument to
n_folds
). The training and evaluation folds are constructed in such a manner as to ensure that each training and evaluation fold contains at least one presence and one absence observation.A grid search method is used to tune the model parameters. The candidate values for each parameter (specified via
parameters
) are used to generate a full set of parameter combinations, and these parameter combinations are subsequently used for tuning the models. To account for unbalanced datasets, thescale_pos_weight
xgboost::xgboost()
parameter is calculated as the mean value across each of the training folds (i.e. number of absence divided by number of presences per feature). For a given parameter combination, models are fit using k-fold cross- validation (viaxgboost::xgb.cv()
) – using the previously mentioned training and evaluation folds – and the True Skill Statistic (TSS) calculated using the data held out from each fold is used to quantify the performance (i.e."test_tss_mean"
column in output). These models are also fitted using theearly_stopping_rounds
parameter to reduce time-spent tuning models. If relevant, they are also fitted using the supplied weights (per by the argument tosite_weights_data
). After exploring the full set of parameter combinations, the best parameter combination is identified, and the associated parameter values and models are stored for later use.The cross-validation models associated with the best parameter combination are used to generate predict the average probability that the feature occupies each site. These predictions include sites that have been surveyed before, and also sites that have not been surveyed before.
The performance of the cross-validation models is evaluated. Specifically, the TSS, sensitivity, and specificity statistics are calculated (if relevant, weighted by the argument to
site_weights_data
). These performance values are calculated using the models' training and evaluation folds.
Value
A list
object containing:
- parameters
list
oflist
objects containing the best tuning parameters for each feature.- predictions
tibble::tibble()
object containing predictions for each feature.- performance
tibble::tibble()
object containing the performance of the best models for each feature. It contains the following columns:- feature
name of the feature.
- train_tss_mean
-
mean TSS statistic for models calculated using training data in cross-validation.
- train_tss_std
-
standard deviation in TSS statistics for models calculated using training data in cross-validation.
- train_sensitivity_mean
-
mean sensitivity statistic for models calculated using training data in cross-validation.
- train_sensitivity_std
-
standard deviation in sensitivity statistics for models calculated using training data in cross-validation.
- train_specificity_mean
-
mean specificity statistic for models calculated using training data in cross-validation.
- train_specificity_std
-
standard deviation in specificity statistics for models calculated using training data in cross-validation.
- test_tss_mean
-
mean TSS statistic for models calculated using test data in cross-validation.
- test_tss_std
-
standard deviation in TSS statistics for models calculated using test data in cross-validation.
- test_sensitivity_mean
-
mean sensitivity statistic for models calculated using test data in cross-validation.
- test_sensitivity_std
-
standard deviation in sensitivity statistics for models calculated using test data in cross-validation.
- test_specificity_mean
-
mean specificity statistic for models calculated using test data in cross-validation.
- test_specificity_std
-
standard deviation in specificity statistics for models calculated using test data in cross-validation.
Examples
## Not run:
# set seeds for reproducibility
set.seed(123)
# simulate data for 30 sites, 2 features, and 3 environmental variables
site_data <- simulate_site_data(
n_sites = 30, n_features = 2, n_env_vars = 3, prop = 0.1)
feature_data <- simulate_feature_data(n_features = 2, prop = 1)
# create list of possible tuning parameters for modeling
parameters <- list(eta = seq(0.1, 0.5, length.out = 3),
lambda = 10 ^ seq(-1.0, 0.0, length.out = 3),
objective = "binary:logistic")
# fit models
# note that we use 10 random search iterations here so that the example
# finishes quickly, you would probably want something like 1000+
results <- fit_xgb_occupancy_models(
site_data, feature_data,
c("f1", "f2"), c("n1", "n2"), c("e1", "e2", "e3"),
"survey_sensitivity", "survey_specificity",
n_folds = rep(5, 2), xgb_early_stopping_rounds = rep(100, 2),
xgb_tuning_parameters = parameters, n_threads = 1)
# print best found model parameters
print(results$parameters)
# print model predictions
print(results$predictions)
# print model performance
print(results$performance, width = Inf)
## End(Not run)