trainBRT {enmSdmX} | R Documentation |
Calibrate a boosted regression tree (generalized boosting machine) model
Description
This function calibrates a boosted regression tree (or gradient boosting machine) model, and is a wrapper for gbm
. The function uses a grid search to assess the best combination of learning rate, tree depth, and bag fraction based on cross-validated deviance. If a particular combination of parameters leads to an unconverged model, the script attempts again using slightly different parameters. Its output is any or all of: a table with deviance of evaluated models; all evaluated models; and/or the single model with the lowest deviance.
Usage
trainBRT(
data,
resp = names(data)[1],
preds = names(data)[2:ncol(data)],
learningRate = c(1e-04, 0.001, 0.01),
treeComplexity = c(5, 3, 1),
bagFraction = 0.6,
minTrees = 1000,
maxTrees = 8000,
tries = 5,
tryBy = c("learningRate", "treeComplexity", "maxTrees", "stepSize"),
w = TRUE,
anyway = FALSE,
family = "bernoulli",
out = "model",
cores = 1,
verbose = FALSE,
...
)
Arguments
data |
Data frame. |
resp |
Response variable. This is either the name of the column in |
preds |
Character list or integer list. Names of columns or column indices of predictors. The default is to use the second and subsequent columns in |
learningRate |
Numeric. Learning rate at which model learns from successive trees (Elith et al. 2008 recommend 0.0001 to 0.1). |
treeComplexity |
Positive integer. Tree complexity: depth of branches in a single tree (1 to 16). |
bagFraction |
Numeric in the range [0, 1]. Bag fraction: proportion of data used for training in cross-validation (Elith et al. 2008 recommend 0.5 to 0.7). |
minTrees |
Positive integer. Minimum number of trees to be scored as a "usable" model (Elith et al. 2008 recommend at least 1000). Default is 1000. |
maxTrees |
Positive integer. Maximum number of trees in model set. |
tries |
Integer > 0. Number of times to try to train a model with a particular set of tuning parameters. The function will stop training the first time a model converges (usually on the first attempt). Non-convergence seems to be related to the number of trees tried in each step. So if non-convergence occurs then the function automatically increases the number of trees in the step size until |
tryBy |
Character list. A list that contains one or more of |
w |
Weights. Any of:
|
anyway |
Logical. If |
family |
Character. Name of error family. |
out |
Character vector. One or more values:
|
cores |
Integer >= 1. Number of cores to use when calculating multiple models. Default is 1. If you have issues when |
verbose |
Logical. If |
... |
Additional arguments (not used). |
Value
The object that is returned depends on the value of the out
argument. It can be a model object, a data frame, a list of models, or a list of two or more of these.
References
Elith, J., J.R. Leathwick, & T. Hastie. 2008. A working guide to boosted regression trees. Journal of Animal Ecology 77:802-813. doi:10.1111/j.1365-2656.2008.01390.x
See Also
Examples
# NB: The examples below show a very basic modeling workflow. They have been
# designed to work fast, not produce accurate, defensible models. They can
# take a few minutes to run.
library(mgcv)
library(sf)
library(terra)
set.seed(123)
### setup data
##############
# environmental rasters
rastFile <- system.file('extdata/madClim.tif', package='enmSdmX')
madClim <- rast(rastFile)
# coordinate reference system
wgs84 <- getCRS('WGS84')
# lemur occurrence data
data(lemurs)
occs <- lemurs[lemurs$species == 'Eulemur fulvus', ]
occs <- vect(occs, geom=c('longitude', 'latitude'), crs=wgs84)
occs <- elimCellDuplicates(occs, madClim)
occEnv <- extract(madClim, occs, ID = FALSE)
occEnv <- occEnv[complete.cases(occEnv), ]
# create 10000 background sites (or as many as raster can support)
bgEnv <- terra::spatSample(madClim, 20000)
bgEnv <- bgEnv[complete.cases(bgEnv), ]
bgEnv <- bgEnv[1:min(10000, nrow(bgEnv)), ]
# collate occurrences and background sites
presBg <- data.frame(
presBg = c(
rep(1, nrow(occEnv)),
rep(0, nrow(bgEnv))
)
)
env <- rbind(occEnv, bgEnv)
env <- cbind(presBg, env)
predictors <- c('bio1', 'bio12')
### calibrate models
####################
# Note that all of the trainXYZ functions can made to go faster using the
# "cores" argument (set to just 1, by default). The examples below will not
# go too much faster using more cores because they are simplified, but
# you can try!
cores <- 1
# MaxEnt
mx <- trainMaxEnt(
data = env,
resp = 'presBg',
preds = predictors,
regMult = 1, # too few values for reliable model, but fast
verbose = TRUE,
cores = cores
)
# MaxNet
mn <- trainMaxNet(
data = env,
resp = 'presBg',
preds = predictors,
regMult = 1, # too few values for reliable model, but fast
verbose = TRUE,
cores = cores
)
# generalized linear model (GLM)
gl <- trainGLM(
data = env,
resp = 'presBg',
preds = predictors,
scale = TRUE, # automatic scaling of predictors
verbose = TRUE,
cores = cores
)
# generalized additive model (GAM)
ga <- trainGAM(
data = env,
resp = 'presBg',
preds = predictors,
verbose = TRUE,
cores = cores
)
# natural splines
ns <- trainNS(
data = env,
resp = 'presBg',
preds = predictors,
scale = TRUE, # automatic scaling of predictors
df = 1:2, # too few values for reliable model(?)
verbose = TRUE,
cores = cores
)
# boosted regression trees
envSub <- env[1:1049, ] # subsetting data to run faster
brt <- trainBRT(
data = envSub,
resp = 'presBg',
preds = predictors,
learningRate = 0.001, # too few values for reliable model(?)
treeComplexity = c(2, 3), # too few values for reliable model, but fast
minTrees = 1200, # minimum trees for reliable model(?), but fast
maxTrees = 1200, # too small for reliable model(?), but fast
tryBy = 'treeComplexity',
anyway = TRUE, # return models that did not converge
verbose = TRUE,
cores = cores
)
# random forests
rf <- trainRF(
data = env,
resp = 'presBg',
preds = predictors,
numTrees = c(100, 500), # using at least 500 recommended, but fast!
verbose = TRUE,
cores = cores
)
### make maps of models
#######################
# NB We do not have to scale rasters before predicting GLMs and NSs because we
# used the `scale = TRUE` argument in trainGLM() and trainNS().
mxMap <- predictEnmSdm(mx, madClim)
mnMap <- predictEnmSdm(mn, madClim)
glMap <- predictEnmSdm(gl, madClim)
gaMap <- predictEnmSdm(ga, madClim)
nsMap <- predictEnmSdm(ns, madClim)
brtMap <- predictEnmSdm(brt, madClim)
rfMap <- predictEnmSdm(rf, madClim)
maps <- c(
mxMap,
mnMap,
glMap,
gaMap,
nsMap,
brtMap,
rfMap
)
names(maps) <- c('MaxEnt', 'MaxNet', 'GLM', 'GAM', 'NSs', 'BRTs', 'RFs')
fun <- function() plot(occs, col='black', pch=3, add=TRUE)
plot(maps, fun = fun, nc = 4)
### compare model responses to BIO12 (mean annual precipitation)
################################################################
# make a data frame holding all other variables at mean across occurrences,
# varying only BIO12
occEnvMeans <- colMeans(occEnv, na.rm=TRUE)
occEnvMeans <- rbind(occEnvMeans)
occEnvMeans <- as.data.frame(occEnvMeans)
climFrame <- occEnvMeans[rep(1, 100), ]
rownames(climFrame) <- NULL
minBio12 <- min(env$bio12)
maxBio12 <- max(env$bio12)
climFrame$bio12 <- seq(minBio12, maxBio12, length.out=100)
predMx <- predictEnmSdm(mx, climFrame)
predMn <- predictEnmSdm(mn, climFrame)
predGl <- predictEnmSdm(gl, climFrame)
predGa <- predictEnmSdm(ga, climFrame)
predNat <- predictEnmSdm(ns, climFrame)
predBrt <- predictEnmSdm(brt, climFrame)
predRf <- predictEnmSdm(rf, climFrame)
plot(climFrame$bio12, predMx,
xlab='BIO12', ylab='Prediction', type='l', ylim=c(0, 1))
lines(climFrame$bio12, predMn, lty='solid', col='red')
lines(climFrame$bio12, predGl, lty='dotted', col='blue')
lines(climFrame$bio12, predGa, lty='dashed', col='green')
lines(climFrame$bio12, predNat, lty=4, col='purple')
lines(climFrame$bio12, predBrt, lty=5, col='orange')
lines(climFrame$bio12, predRf, lty=6, col='cyan')
legend(
'topleft',
inset = 0.01,
legend = c(
'MaxEnt',
'MaxNet',
'GLM',
'GAM',
'NS',
'BRT',
'RF'
),
lty = c(1, 1:6),
col = c(
'black',
'red',
'blue',
'green',
'purple',
'orange',
'cyan'
),
bg = 'white'
)