R: Summarize distribution/niche model cross-validation object

summaryByCrossValid {enmSdmX}

R Documentation

Summarize distribution/niche model cross-validation object

Description

This function summarizes models calibrated using the trainByCrossValid function. It returns aspects of the best models across k-folds (the particular aspects depends on the kind of models used).

Usage

summaryByCrossValid(
  x,
  metric = "cbiTest",
  decreasing = TRUE,
  interceptOnly = TRUE
)

Arguments

`x`	The output from the `trainByCrossValid` function (which is a list). Note that the object must include a sublist named `tuning`.
`metric`	Metric by which to select the best model in each k-fold. This can be any of the columns that appear in the data frames in `x$tuning` (or any columns added manually), but typically is one of the following plus either `Train`, `Test`, or `Delta` (e.g., `'logLossTrain'`, `'logLossTest'`, or `'logLossDelta'`): `'logLoss'`: Log loss. `'cbi'`: Continuous Boyce Index (CBI). Calculated with `evalContBoyce`. `'auc'`: Area under the receiver-operator characteristic curve (AUC). Calculated with `evalAUC`. `'tss'`: Maximum value of the True Skill Statistic. Calculated with `evalTSS`. `'msss'`: Sensitivity and specificity calculated at the threshold that maximizes sensitivity (true presence prediction rate) plus specificity (true absence prediction rate). `'mdss'`: Sensitivity (se) and specificity (sp) calculated at the threshold that minimizes the difference between sensitivity and specificity. `'minTrainPres'`: Sensitivity and specificity at the greatest threshold at which all training presences are classified as "present". `'trainSe95'` and/or `'trainSe90'`: Sensitivity at the threshold that ensures either 95
`decreasing`	Logical, if `TRUE` (default), for each k-fold sort models by the value listed in `metric` in decreasing order (highest connotes "best", lowest "worst"). If `FALSE` use the lowest value of `metric`.
`interceptOnly`	Logical. If `TRUE` (default) and the top models in each case were intercept-only models, return an emppty data frame (with a warning). If `FALSE`, return results using the first model in each fold that was not an intercept-only model. This is only used if the training function was a generalized linear model (GLM), natural splines model (NS), or generalized additive model (GAM).

Value

Data frame with statistics on the best set of models across k-folds. Depending on the model algorithm, this could be:

BRTs (boosted regression trees): Learning rate, tree complexity, and bag fraction.
GLMs (generalized linear models): Frequency of use of each term in the best models.
Maxent: Frequency of times each specific combination of feature classes was used in the best models plus mean master regularization multiplier for each feature set.
NSs (natural splines): Data frame, one row per fold and one column per predictor, with values representing the maximum degrees of freedom used for each variable in the best model of each fold.
RFs (random forests): Data frame, one row per fold, with values representing the optimal value of numTrees and mtry (see ranger).

Examples

# The example below show a very basic modeling workflow. It has been 
# designed to work fast, not produce accurate, defensible models.
# The general idea is to calibrate a series of models and evaluate them
# against a withheld set of data. One can then use the series of models
# of the top models to better select a "final" model.

## Not run: 
# Running the entire set of commands can take a few minutes. This can
# be sped up by increasing the number of cores used. The examples below use
# one core, but you can change that argument according to your machine's
# capabilities.

library(sf)
library(terra)
set.seed(123)

### setup data
##############

# environmental rasters
rastFile <- system.file('extdata/madClim.tif', package='enmSdmX')
madClim <- rast(rastFile)

# coordinate reference system
wgs84 <- getCRS('WGS84')

# lemur occurrence data
data(lemurs)
occs <- lemurs[lemurs$species == 'Eulemur fulvus', ]
occs <- vect(occs, geom=c('longitude', 'latitude'), crs=wgs84)

occs <- elimCellDuplicates(occs, madClim)

occEnv <- extract(madClim, occs, ID = FALSE)
occEnv <- occEnv[complete.cases(occEnv), ]
	
# create background sites (using just 1000 to speed things up!)
bgEnv <- terra::spatSample(madClim, 3000)
bgEnv <- bgEnv[complete.cases(bgEnv), ]
bgEnv <- bgEnv[sample(nrow(bgEnv), 1000), ]

# collate occurrences and background sites
presBg <- data.frame(
   presBg = c(
      rep(1, nrow(occEnv)),
      rep(0, nrow(bgEnv))
   )
)

env <- rbind(occEnv, bgEnv)
env <- cbind(presBg, env)

predictors <- c('bio1', 'bio12')

# using "vector" form of "folds" argument
folds <- dismo::kfold(env, 3) # just 3 folds (for speed)

### calibrate models
####################

cores <- 1 # increase this to go faster, if your computer handles it

## MaxEnt
mxx <- trainByCrossValid(
	data = env,
	resp = 'presBg',
	preds = c('bio1', 'bio12'),
	folds = folds,
	trainFx = trainMaxEnt,
	regMult = 1:2, # too few values for valid model, but fast!
	verbose = 1,
	cores = cores
)

# summarize MaxEnt feature sets and regularization across folds
summaryByCrossValid(mxx)

## MaxNet
mnx <- trainByCrossValid(
	data = env,
	resp = 'presBg',
	preds = c('bio1', 'bio12'),
	folds = folds,
	trainFx = trainMaxNet,
	regMult = 1:2, # too few values for valid model, but fast!
	verbose = 1,
	cores = cores
)

# summarize MaxEnt feature sets and regularization across folds
summaryByCrossValid(mnx)

## generalized linear models
glx <- trainByCrossValid(
	data = env,
	resp = 'presBg',
	preds = c('bio1', 'bio12'),
	folds = folds,
	trainFx = trainGLM,
	verbose = 1,
	cores = cores
)

# summarize GLM terms in best models
summaryByCrossValid(glx)

## generalized additive models
gax <- trainByCrossValid(
	data = env,
	resp = 'presBg',
	preds = c('bio1', 'bio12'),
	folds = folds,
	trainFx = trainGAM,
	verbose = 1,
	cores = cores
)

# summarize GAM terms in best models
summaryByCrossValid(gax)

## natural splines
nsx <- trainByCrossValid(
	data = env,
	resp = 'presBg',
	preds = c('bio1', 'bio12'),
	folds = folds,
	trainFx = trainNS,
	df = 1:2,
	verbose = 1,
	cores = cores
)

# summarize NS terms in best models
summaryByCrossValid(nsx)

## boosted regression trees
brtx <- trainByCrossValid(
	data = env,
	resp = 'presBg',
	preds = c('bio1', 'bio12'),
	folds = folds,
	trainFx = trainBRT,
	learningRate = c(0.001, 0.0001), # too few values for reliable model(?)
	treeComplexity = c(2, 4), # too few values for reliable model, but fast
	minTrees = 1000,
	maxTrees = 1500, # too small for reliable model(?), but fast
	tryBy = 'treeComplexity',
	anyway = TRUE, # return models that did not converge
	verbose = 1,
	cores = cores
)

# summarize BRT parameters across best models
summaryByCrossValid(brtx)

## random forests
rfx <- trainByCrossValid(
	data = env,
	resp = 'presBg',
	preds = c('bio1', 'bio12'),
	folds = folds,
	trainFx = trainRF,
	verbose = 1,
	cores = cores
)

# summarize RF parameters in best models
summaryByCrossValid(rfx)


## End(Not run)