trainBRT {enmSdmX}R Documentation

Calibrate a boosted regression tree (generalized boosting machine) model

Description

This function calibrates a boosted regression tree (or gradient boosting machine) model, and is a wrapper for gbm. The function uses a grid search to assess the best combination of learning rate, tree depth, and bag fraction based on cross-validated deviance. If a particular combination of parameters leads to an unconverged model, the script attempts again using slightly different parameters. Its output is any or all of: a table with deviance of evaluated models; all evaluated models; and/or the single model with the lowest deviance.

Usage

trainBRT(
  data,
  resp = names(data)[1],
  preds = names(data)[2:ncol(data)],
  learningRate = c(1e-04, 0.001, 0.01),
  treeComplexity = c(5, 3, 1),
  bagFraction = 0.6,
  minTrees = 1000,
  maxTrees = 8000,
  tries = 5,
  tryBy = c("learningRate", "treeComplexity", "maxTrees", "stepSize"),
  w = TRUE,
  anyway = FALSE,
  family = "bernoulli",
  out = "model",
  cores = 1,
  verbose = FALSE,
  ...
)

Arguments

data

Data frame.

resp

Response variable. This is either the name of the column in data or an integer indicating the column in data that has the response variable. The default is to use the first column in data as the response.

preds

Character list or integer list. Names of columns or column indices of predictors. The default is to use the second and subsequent columns in data.

learningRate

Numeric. Learning rate at which model learns from successive trees (Elith et al. 2008 recommend 0.0001 to 0.1).

treeComplexity

Positive integer. Tree complexity: depth of branches in a single tree (1 to 16).

bagFraction

Numeric in the range [0, 1]. Bag fraction: proportion of data used for training in cross-validation (Elith et al. 2008 recommend 0.5 to 0.7).

minTrees

Positive integer. Minimum number of trees to be scored as a "usable" model (Elith et al. 2008 recommend at least 1000). Default is 1000.

maxTrees

Positive integer. Maximum number of trees in model set.

tries

Integer > 0. Number of times to try to train a model with a particular set of tuning parameters. The function will stop training the first time a model converges (usually on the first attempt). Non-convergence seems to be related to the number of trees tried in each step. So if non-convergence occurs then the function automatically increases the number of trees in the step size until tries is reached.

tryBy

Character list. A list that contains one or more of 'learningRate', 'treeComplexity', numTrees, and/or 'stepSize'. If a given combination of learningRate, treeComplexity, numTrees, stepSize, and bagFraction do not allow model convergence then then the function tries again but with alterations to any of the arguments named in tryBy: * learningRate: Decrease the learning rate by a factor of 10. * treeComplexity: Randomly increase/decrease tree complexity by 1 (minimum of 1). * maxTrees: Increase number of trees by 20 * stepSize: Increase step size (argument n.trees in gbm.step()) by 50 If tryBy is NULL then the function attempts to train the model with the same parameters up to tries times.

w

Weights. Any of:

  • TRUE: Causes the total weight of presences to equal the total weight of absences (if family='binomial')

  • FALSE: Each datum is assigned a weight of 1.

  • A numeric vector of weights, one per row in data.

  • The name of the column in data that contains site weights.

anyway

Logical. If FALSE (default), it is possible for no models to be returned if none converge and/or none had a number of trees is >= minTrees). If TRUE then all models are returned but with a warning.

family

Character. Name of error family.

out

Character vector. One or more values:

  • 'model': Model with the lowest deviance.

  • 'models': All models evaluated, sorted from lowest to highest deviance.

  • 'tuning': Data frame with tuning parameters, one row per model, sorted by deviance.

cores

Integer >= 1. Number of cores to use when calculating multiple models. Default is 1. If you have issues when cores > 1, please see the troubleshooting_parallel_operations guide.

verbose

Logical. If TRUE display progress.

...

Additional arguments (not used).

Value

The object that is returned depends on the value of the out argument. It can be a model object, a data frame, a list of models, or a list of two or more of these.

References

Elith, J., J.R. Leathwick, & T. Hastie. 2008. A working guide to boosted regression trees. Journal of Animal Ecology 77:802-813. doi:10.1111/j.1365-2656.2008.01390.x

See Also

gbm

Examples


# NB: The examples below show a very basic modeling workflow. They have been 
# designed to work fast, not produce accurate, defensible models. They can
# take a few minutes to run.

library(mgcv)
library(sf)
library(terra)
set.seed(123)

### setup data
##############

# environmental rasters
rastFile <- system.file('extdata/madClim.tif', package='enmSdmX')
madClim <- rast(rastFile)

# coordinate reference system
wgs84 <- getCRS('WGS84')

# lemur occurrence data
data(lemurs)
occs <- lemurs[lemurs$species == 'Eulemur fulvus', ]
occs <- vect(occs, geom=c('longitude', 'latitude'), crs=wgs84)

occs <- elimCellDuplicates(occs, madClim)

occEnv <- extract(madClim, occs, ID = FALSE)
occEnv <- occEnv[complete.cases(occEnv), ]
	
# create 10000 background sites (or as many as raster can support)
bgEnv <- terra::spatSample(madClim, 20000)
bgEnv <- bgEnv[complete.cases(bgEnv), ]
bgEnv <- bgEnv[1:min(10000, nrow(bgEnv)), ]

# collate occurrences and background sites
presBg <- data.frame(
  presBg = c(
    rep(1, nrow(occEnv)),
    rep(0, nrow(bgEnv))
  )
)

env <- rbind(occEnv, bgEnv)
env <- cbind(presBg, env)

predictors <- c('bio1', 'bio12')

### calibrate models
####################

# Note that all of the trainXYZ functions can made to go faster using the
# "cores" argument (set to just 1, by default). The examples below will not
# go too much faster using more cores because they are simplified, but
# you can try!
cores <- 1

# MaxEnt
mx <- trainMaxEnt(
	data = env,
	resp = 'presBg',
	preds = predictors,
	regMult = 1, # too few values for reliable model, but fast
	verbose = TRUE,
	cores = cores
)

# MaxNet
mn <- trainMaxNet(
	data = env,
	resp = 'presBg',
	preds = predictors,
	regMult = 1, # too few values for reliable model, but fast
	verbose = TRUE,
	cores = cores
)

# generalized linear model (GLM)
gl <- trainGLM(
	data = env,
	resp = 'presBg',
	preds = predictors,
	scale = TRUE, # automatic scaling of predictors
	verbose = TRUE,
	cores = cores
)

# generalized additive model (GAM)
ga <- trainGAM(
	data = env,
	resp = 'presBg',
	preds = predictors,
	verbose = TRUE,
	cores = cores
)

# natural splines
ns <- trainNS(
	data = env,
	resp = 'presBg',
	preds = predictors,
	scale = TRUE, # automatic scaling of predictors
	df = 1:2, # too few values for reliable model(?)
	verbose = TRUE,
	cores = cores
)

# boosted regression trees
envSub <- env[1:1049, ] # subsetting data to run faster
brt <- trainBRT(
	data = envSub,
	resp = 'presBg',
	preds = predictors,
	learningRate = 0.001, # too few values for reliable model(?)
	treeComplexity = c(2, 3), # too few values for reliable model, but fast
	minTrees = 1200, # minimum trees for reliable model(?), but fast
	maxTrees = 1200, # too small for reliable model(?), but fast
	tryBy = 'treeComplexity',
	anyway = TRUE, # return models that did not converge
	verbose = TRUE,
	cores = cores
)

# random forests
rf <- trainRF(
	data = env,
	resp = 'presBg',
	preds = predictors,
	numTrees = c(100, 500), # using at least 500 recommended, but fast!
	verbose = TRUE,
	cores = cores
)

### make maps of models
#######################

# NB We do not have to scale rasters before predicting GLMs and NSs because we
# used the `scale = TRUE` argument in trainGLM() and trainNS().

mxMap <- predictEnmSdm(mx, madClim)
mnMap <- predictEnmSdm(mn, madClim) 
glMap <- predictEnmSdm(gl, madClim)
gaMap <- predictEnmSdm(ga, madClim)
nsMap <- predictEnmSdm(ns, madClim)
brtMap <- predictEnmSdm(brt, madClim)
rfMap <- predictEnmSdm(rf, madClim)

maps <- c(
	mxMap,
	mnMap,
	glMap,
	gaMap,
	nsMap,
	brtMap,
	rfMap
)

names(maps) <- c('MaxEnt', 'MaxNet', 'GLM', 'GAM', 'NSs', 'BRTs', 'RFs')
fun <- function() plot(occs, col='black', pch=3, add=TRUE)
plot(maps, fun = fun, nc = 4)

### compare model responses to BIO12 (mean annual precipitation)
################################################################

# make a data frame holding all other variables at mean across occurrences,
# varying only BIO12
occEnvMeans <- colMeans(occEnv, na.rm=TRUE)
occEnvMeans <- rbind(occEnvMeans)
occEnvMeans <- as.data.frame(occEnvMeans)
climFrame <- occEnvMeans[rep(1, 100), ]
rownames(climFrame) <- NULL

minBio12 <- min(env$bio12)
maxBio12 <- max(env$bio12)
climFrame$bio12 <- seq(minBio12, maxBio12, length.out=100)

predMx <- predictEnmSdm(mx, climFrame)
predMn <- predictEnmSdm(mn, climFrame)
predGl <- predictEnmSdm(gl, climFrame)
predGa <- predictEnmSdm(ga, climFrame)
predNat <- predictEnmSdm(ns, climFrame)
predBrt <- predictEnmSdm(brt, climFrame)
predRf <- predictEnmSdm(rf, climFrame)


plot(climFrame$bio12, predMx,
xlab='BIO12', ylab='Prediction', type='l', ylim=c(0, 1))

lines(climFrame$bio12, predMn, lty='solid', col='red')
lines(climFrame$bio12, predGl, lty='dotted', col='blue')
lines(climFrame$bio12, predGa, lty='dashed', col='green')
lines(climFrame$bio12, predNat, lty=4, col='purple')
lines(climFrame$bio12, predBrt, lty=5, col='orange')
lines(climFrame$bio12, predRf, lty=6, col='cyan')

legend(
   'topleft',
   inset = 0.01,
   legend = c(
	'MaxEnt',
	'MaxNet',
	'GLM',
	'GAM',
	'NS',
	'BRT',
	'RF'
   ),
   lty = c(1, 1:6),
   col = c(
	'black',
	'red',
	'blue',
	'green',
	'purple',
	'orange',
	'cyan'
   ),
   bg = 'white'
)



[Package enmSdmX version 1.1.6 Index]