R: Calibrate a distribution/niche model using cross-validation

trainByCrossValid {enmSdmX}

R Documentation

Calibrate a distribution/niche model using cross-validation

Description

This function is an extension of any of the trainXYZ functions for calibrating species distribution and ecological niche models. This function uses the trainXYZ function to calibrate and evaluate a suite of models using cross-validation. The models are evaluated against withheld data to determine the optimal settings for a "final" model using all available data. The function returns a set of models and/or a table with statistics on each model. The statistics represent various measures of model accuracy, and are calculated against training and test sites (separately).

Usage

trainByCrossValid(
  data,
  resp = names(data)[1],
  preds = names(data)[2:ncol(data)],
  folds = predicts::folds(data),
  trainFx = enmSdmX::trainGLM,
  ...,
  weightEvalTrain = TRUE,
  weightEvalTest = TRUE,
  na.rm = FALSE,
  outputModels = TRUE,
  verbose = 0
)

Arguments

`data`	Data frame or matrix. Response variable and environmental predictors (and no other fields) for presences and non-presence sites.
`resp`	Character or integer. Name or column index of response variable. Default is to use the first column in `data`.
`preds`	Character vector or integer vector. Names of columns or column indices of predictors. Default is to use the second and subsequent columns in `data` as predictors.
`folds`	Either a numeric vector, or matrix or data frame which specify which rows in `data` belong to which folds: If a vector, there must be one value per row in `data`. If there are K unique values in the vector, then K unique models will be trained. Each model will use all of the data except for rows that match a particular value in the `folds` vector. For example, if `folds = c(1, 1, 1, 2, 2, 2, 3, 3, 3)`, then three models will be trained, one with all rows that match the 2s and 3s, one with all rows matching 1s and 2s, and one will all rows matching 1s and 3s. The models will be evaluated against the training data and against the withheld data. Use `NA` to exclude rows from all testing/training. The default is to construct 5 folds of roughly equal size. If a matrix or data frame, there must be one row per row in `data`. Each column corresponds to a different model to be trained. For a given column there should be only two unique values, plus possibly `NA`s. Of the two values, the lesser value will be used to identify the calibration data and the greater value the evaluation data. Rows with `NA`s will be ignored and not used in training or testing. For example, a particular column could contain 1s, 2, and `NA`s. Data rows corresponding to 1s will be used as training data, data rows corresponding to 2s as test data, and rows with `NA` are dropped. The `NA` flag is useful for creating spatially-structured cross-validation folds where training and test sites are separated (spatially) by censored (ignored) data.
`trainFx`	Function, name of the `trainXYZ` function to use. Currently the functions/algorithms supported are `trainBRT`, `trainGAM`, `trainGLM`, `trainMaxEnt`, `trainRF`, and `trainNS`.
`...`	Arguments to pass to the "trainXYZ" function.
`weightEvalTrain`	Logical, if `TRUE` (default) and an argument named `w` is specified in `...`, then evaluation statistics that support weighting will use the weights specified by `w` for the "train" version of evaluation statistics. If `FALSE`, there will be no weighting of sites. Note that this applies only to the calculation of evaluation statistics, not to model calibration. If `w` is supplied, they will be used for model calibration.
`weightEvalTest`	Logical, if `TRUE` (default) and an argument named `w` is specified in `...`, then evaluation statistics that support weighting will use the weights specified by `w` for the "test" version of evaluation statistics. If `FALSE`, there will be no weighting of sites. Note that this applies only to the calculation of evaluation statistics. If `w` is supplied, they will be used for model calibration.
`na.rm`	Logical, if `TRUE` then remove `NA` predictions before calculating evaluation statistics. If `FALSE` (default), propagate `NA`s (meaning if predictions contain `NA`s, then the evaluation statistic will most likely also be `NA`.)
`outputModels`	If `TRUE`, then return all models (in addition to tables reporting tuning paramaeters and evaluation metrics). WARNING: Depending on the type of model and amount of data, retuning all models may produce objects that are very large in memory.
`verbose`	Numeric. If 0 show no progress updates. If > 0 then show minimal progress updates for this function only. If > 1 show detailed progress for this function. If > 2 show detailed progress plus detailed progress for the `trainXYZ` function.

Details

In some cases models do not converge (e.g., boosted regression trees and generalized additive models sometimes suffer from this issue). In this case the model will be skipped, but a data frame with the k-fold and model number in the fold will be returned in the $meta element in the output. If no models converged, then this data frame will be empty.

Value

A list object with several named elements:

meta: Meta-data on the model call.
folds: The folds object.
models (if outputModels is TRUE): A list of model objects, one per data fold.
tuning: One data frame per k-fold, each containing evaluation statistics for all candidate models in the fold. In addition to algorithm-specific fields, these consist of:
- 'logLoss': Log loss. Higher (less negative) values imply better fit.
- 'cbi': Continuous Boyce Index (CBI). Calculated with evalContBoyce.
- 'auc': Area under the receiver-operator characteristic curve (AUC). Calculated with evalAUC.
- 'tss': Maximum value of the True Skill Statistic. Calculated with evalTSS.
- 'msss': Sensitivity and specificity calculated at the threshold that maximizes sensitivity (true presence prediction rate) plus specificity (true absence prediction rate).
- 'mdss': Sensitivity (se) and specificity (sp) calculated at the threshold that minimizes the difference between sensitivity and specificity.
- 'minTrainPres': Sensitivity (se) and specificity (sp) at the greatest threshold at which all training presences are classified as "present".
- 'trainSe95' and/or 'trainSe90': Sensitivity (se) and specificity (sp) at the threshold that ensures either 95 or 90 percent of all training presences are classified as "present" (training sensitivity = 0.95 or 0.9).

References

Fielding, A.H. and J.F. Bell. 1997. A review of methods for the assessment of prediction errors in conservation presence/absence models. Environmental Conservation 24:38-49. doi:10.1017/S0376892997000088 La Rest, K., Pinaud, D., Monestiez, P., Chadoeuf, J., and Bretagnolle, V. 2014. Spatial leave-one-out cross-validation for variable selection in the presence of spatial autocorrelation. Global Ecology and Biogeography 23:811-820. doi:10.1111/geb.12161 Radosavljevic, A. and Anderson, R.P. 2014. Making better Maxent models of species distributions: complexity, overfitting and evaluation. Journal of Biogeography 41:629-643. doi:10.1111/jbi.12227

Examples

# The example below show a very basic modeling workflow. It has been 
# designed to work fast, not produce accurate, defensible models.
# The general idea is to calibrate a series of models and evaluate them
# against a withheld set of data. One can then use the series of models
# of the top models to better select a "final" model.

## Not run: 
# Running the entire set of commands can take a few minutes. This can
# be sped up by increasing the number of cores used. The examples below use
# one core, but you can change that argument according to your machine's
# capabilities.

library(sf)
library(terra)
set.seed(123)

### setup data
##############

# environmental rasters
rastFile <- system.file('extdata/madClim.tif', package='enmSdmX')
madClim <- rast(rastFile)

# coordinate reference system
wgs84 <- getCRS('WGS84')

# lemur occurrence data
data(lemurs)
occs <- lemurs[lemurs$species == 'Eulemur fulvus', ]
occs <- vect(occs, geom=c('longitude', 'latitude'), crs=wgs84)

occs <- elimCellDuplicates(occs, madClim)

occEnv <- extract(madClim, occs, ID = FALSE)
occEnv <- occEnv[complete.cases(occEnv), ]
	
# create background sites (using just 1000 to speed things up!)
bgEnv <- terra::spatSample(madClim, 3000)
bgEnv <- bgEnv[complete.cases(bgEnv), ]
bgEnv <- bgEnv[sample(nrow(bgEnv), 1000), ]

# collate occurrences and background sites
presBg <- data.frame(
   presBg = c(
      rep(1, nrow(occEnv)),
      rep(0, nrow(bgEnv))
   )
)

env <- rbind(occEnv, bgEnv)
env <- cbind(presBg, env)

predictors <- c('bio1', 'bio12')

# using "vector" form of "folds" argument
folds <- dismo::kfold(env, 3) # just 3 folds (for speed)

### calibrate models
####################

cores <- 1 # increase this to go faster, if your computer handles it

## MaxEnt
mxx <- trainByCrossValid(
	data = env,
	resp = 'presBg',
	preds = c('bio1', 'bio12'),
	folds = folds,
	trainFx = trainMaxEnt,
	regMult = 1:2, # too few values for valid model, but fast!
	verbose = 1,
	cores = cores
)

# summarize MaxEnt feature sets and regularization across folds
summaryByCrossValid(mxx)

## MaxNet
mnx <- trainByCrossValid(
	data = env,
	resp = 'presBg',
	preds = c('bio1', 'bio12'),
	folds = folds,
	trainFx = trainMaxNet,
	regMult = 1:2, # too few values for valid model, but fast!
	verbose = 1,
	cores = cores
)

# summarize MaxEnt feature sets and regularization across folds
summaryByCrossValid(mnx)

## generalized linear models
glx <- trainByCrossValid(
	data = env,
	resp = 'presBg',
	preds = c('bio1', 'bio12'),
	folds = folds,
	trainFx = trainGLM,
	verbose = 1,
	cores = cores
)

# summarize GLM terms in best models
summaryByCrossValid(glx)

## generalized additive models
gax <- trainByCrossValid(
	data = env,
	resp = 'presBg',
	preds = c('bio1', 'bio12'),
	folds = folds,
	trainFx = trainGAM,
	verbose = 1,
	cores = cores
)

# summarize GAM terms in best models
summaryByCrossValid(gax)

## natural splines
nsx <- trainByCrossValid(
	data = env,
	resp = 'presBg',
	preds = c('bio1', 'bio12'),
	folds = folds,
	trainFx = trainNS,
	df = 1:2,
	verbose = 1,
	cores = cores
)

# summarize NS terms in best models
summaryByCrossValid(nsx)

## boosted regression trees
brtx <- trainByCrossValid(
	data = env,
	resp = 'presBg',
	preds = c('bio1', 'bio12'),
	folds = folds,
	trainFx = trainBRT,
	learningRate = c(0.001, 0.0001), # too few values for reliable model(?)
	treeComplexity = c(2, 4), # too few values for reliable model, but fast
	minTrees = 1000,
	maxTrees = 1500, # too small for reliable model(?), but fast
	tryBy = 'treeComplexity',
	anyway = TRUE, # return models that did not converge
	verbose = 1,
	cores = cores
)

# summarize BRT parameters across best models
summaryByCrossValid(brtx)

## random forests
rfx <- trainByCrossValid(
	data = env,
	resp = 'presBg',
	preds = c('bio1', 'bio12'),
	folds = folds,
	trainFx = trainRF,
	verbose = 1,
	cores = cores
)

# summarize RF parameters in best models
summaryByCrossValid(rfx)


## End(Not run)

[Package enmSdmX version 1.1.6 Index]