R: Fits spatial random forest models

rf_spatial {spatialRF}

R Documentation

Fits spatial random forest models

Description

Fits spatial random forest models using different methods to generate, rank, and select spatial predictors acting as proxies of spatial processes not considered by the non-spatial predictors. The end goal is providing the model with information about the spatial structure of the data to minimize the spatial correlation (Moran's I) of the model residuals and generate honest variable importance scores.

Usage

rf_spatial(
  model = NULL,
  data = NULL,
  dependent.variable.name = NULL,
  predictor.variable.names = NULL,
  distance.matrix = NULL,
  distance.thresholds = NULL,
  xy = NULL,
  ranger.arguments = NULL,
  scaled.importance = TRUE,
  method = c("mem.moran.sequential", "mem.effect.sequential", "mem.effect.recursive",
    "hengl", "hengl.moran.sequential", "hengl.effect.sequential",
    "hengl.effect.recursive", "pca.moran.sequential", "pca.effect.sequential",
    "pca.effect.recursive"),
  max.spatial.predictors = NULL,
  weight.r.squared = NULL,
  weight.penalization.n.predictors = NULL,
  seed = 1,
  verbose = TRUE,
  n.cores = parallel::detectCores() - 1,
  cluster = NULL
)

Arguments

`model`	A model fitted with `rf()`. If used, the arguments `data`, `dependent.variable.name`, `predictor.variable.names`, `distance.matrix`, `distance.thresholds`, `ranger.arguments`, and `scaled.importance` are taken directly from the model definition. Default: NULL
`data`	Data frame with a response variable and a set of predictors. Default: `NULL`
`dependent.variable.name`	Character string with the name of the response variable. Must be in the column names of `data`. If the dependent variable is binary with values 1 and 0, the argument `case.weights` of `ranger` is populated by the function `case_weights()`. Default: `NULL`
`predictor.variable.names`	Character vector with the names of the predictive variables. Every element of this vector must be in the column names of `data`. Default: `NULL`
`distance.matrix`	Squared matrix with the distances among the records in `data`. The number of rows of `distance.matrix` and `data` must be the same. If not provided, the computation of the Moran's I of the residuals is omitted. Default: `NULL`
`distance.thresholds`	Numeric vector with distances in the same units as `distance.matrix` Distances below each distance threshold are set to 0 on separated copies of the distance matrix to compute Moran's I at different neighborhood distances. If `NULL`, it defaults to `seq(0, max(distance.matrix)/2, length.out = 4)` (defined by `default_distance_thresholds()`). Default: `NULL`
`xy`	(optional) Data frame or matrix with two columns containing coordinates and named "x" and "y". It is not used by this function, but it is stored in the slot `ranger.arguments$xy` of the model, so it can be used by `rf_evaluate()` and `rf_tuning()`. Default: `NULL`
`ranger.arguments`	Named list with ranger arguments (other arguments of this function can also go here). All ranger arguments are set to their default values except for 'importance', that is set to 'permutation' rather than 'none'. Please, consult the help file of ranger if you are not familiar with the arguments of this function.
`scaled.importance`	Logical. If `TRUE`, and 'importance = "permutation', the function scales 'data' with scale and fits a new model to compute scaled variable importance scores. Default: `TRUE`
`method`	Character, method to build, rank, and select spatial predictors. One of: "hengl" "hengl.moran.sequential" (experimental) "hengl.effect.sequential" (experimental) "hengl.effect.recursive" (experimental) "pca.moran.sequential" (experimental) "pca.effect.sequential" (experimental) "pca.effect.recursive" (experimental) "mem.moran.sequential" "mem.effect.sequential" "mem.effect.recursive"
`max.spatial.predictors`	Integer, maximum number of spatial predictors to generate. Useful when memory problems arise due to a large number of spatial predictors, Default: `NULL`
`weight.r.squared`	Numeric between 0 and 1, weight of R-squared in the selection of spatial components. See Details, Default: `NULL`
`weight.penalization.n.predictors`	Numeric between 0 and 1, weight of the penalization for adding an increasing number of spatial predictors during selection. Default: `NULL`
`seed`	Integer, random seed to facilitate reproducibility. Default: `1`.
`verbose`	Logical. If TRUE, messages and plots generated during the execution of the function are displayed, Default: `TRUE`
`n.cores`	Integer, number of cores to use for parallel execution. Creates a socket cluster with `parallel::makeCluster()`, runs operations in parallel with `foreach` and `⁠%dopar%⁠`, and stops the cluster with `parallel::clusterStop()` when the job is done. Default: `parallel::detectCores() - 1`
`cluster`	A cluster definition generated with `parallel::makeCluster()`. If provided, overrides `n.cores`. When `cluster = NULL` (default value), and `model` is provided, the cluster in `model`, if any, is used instead. If this cluster is `NULL`, then the function uses `n.cores` instead. The function does not stop a provided cluster, so it should be stopped with `parallel::stopCluster()` afterwards. The cluster definition is stored in the output list under the name "cluster" so it can be passed to other functions via the `model` argument, or using the `⁠%>%⁠` pipe. Default: `NULL`

Details

The function uses three different methods to generate spatial predictors ("hengl", "pca", and "mem"), two methods to rank them in order to define in what order they are introduced in the model ("effect" and "moran), and two methods to select the spatial predictors that minimize the spatial correlation of the model residuals ("sequential" and "recursive"). All method names but "hengl" (that uses the complete distance matrix as predictors in the spatial model) are named by combining a method to generate the spatial predictors, a method to rank them, and a method to select them, separated by a point. Examples are "mem.moran.sequential" or "mem.effect.recursive". All combinations are not possible, since the ranking method "moran" cannot be used with the selection method "recursive" (because the logics behind them are very different, see below). Methods to generate spatial predictors:

"hengl": named after the method RFsp presented in the paper "Random forest as a generic framework for predictive modeling of spatial and spatio-temporal variables", by Hengl et al. (2018), where the authors propose to use the distance matrix among records as predictors in spatial random forest models (RFsp method). In this function, all methods starting with "hengl" use either the complete distance matrix, or select columns of the distance matrix as spatial predictors.
"mem": Generates Moran's Eigenvector Maps, that is, the eigenvectors of the double-centered weights of the distance matrix. The method is described in "Spatial modelling: a comprehensive framework for principal coordinate analysis of neighbour matrices (PCNM)", by Dray et al. (2006), and "Statistical methods for temporal and space–time analysis of community composition data", by Legendre and Gauthier (2014).
"pca": Computes spatial predictors from the principal component analysis of a weighted distance matrix (see weights_from_distance_matrix()). This is an experimental method, use with caution.

Methods to rank spatial predictors (see rank_spatial_predictors()):

"moran": Computes the Moran's I of each spatial predictor, selects the ones with positive values, and ranks them from higher to lower Moran's I.
"effect": If a given non-spatial random forest model is defined as y = p1 + ... + pn, being p1 + ... + pn the set of predictors, for every spatial predictor generated (spX) a spatial model y = p1 + ... + pn + spX is fitted, and the Moran's I of its residuals is computed. The spatial predictors are then ranked by how much they help to reduce spatial autocorrelation between the non-spatial and the spatial model.

Methods to select spatial predictors:

"sequential" (see select_spatial_predictors_sequential()): The spatial predictors are added one by one in the order they were ranked, and once all spatial predictors are introduced, the best first n predictors are selected. This method is similar to the one employed in the MEM methodology (Moran's Eigenvector Maps) described in the paper "Spatial modelling: a comprehensive framework for principal coordinate analysis of neighbour matrices (PCNM)", by Dray et al. (2006), and "Statistical methods for temporal and space–time analysis of community composition data", by Legendre and Gauthier (2014). This method generally introduces tens of predictors into the model, but usually offers good results.
"recursive" (see select_spatial_predictors_recursive()): This method tries to find the smallest combination of spatial predictors that reduce the spatial correlation of the model's residuals the most. The algorithm goes as follows: 1. The first ranked spatial predictor is introduced into the model; 2. the remaining predictors are ranked again using the "effect" method, using the model in 1. as reference. The first spatial predictor in the resulting ranking is then introduced into the model, and the steps 1. and 2. are repeated until spatial predictors stop having an effect in reducing the Moran's I of the model residuals. This method takes longer to compute, but generates smaller sets of spatial predictors. This is an experimental method, use with caution.

Once ranking procedure is completed, an algorithm is used to select the minimal subset of spatial predictors that reduce the most the Moran's I of the residuals: for each new spatial predictor introduced in the model, the Moran's I of the residuals, it's p-value, a binary version of the p-value (0 if < 0.05 and 1 if >= 0.05), the R-squared of the model, and a penalization linear with the number of spatial predictors introduced (computed as ⁠(1 / total spatial predictors) * introduced spatial predictors⁠) are rescaled between 0 and 1. Then, the optimization criteria is computed as ⁠max(1 - Moran's I, p-value binary) + (weight.r.squared * R-squared) - (weight.penalization.n.predictors * penalization)⁠. The predictors from the first one to the one with the highest optimization criteria are then selected as the best ones in reducing the spatial correlation of the model residuals, and used along with data to fit the final spatial model.

Value

A ranger model with several new slots:

ranger.arguments: Values of the arguments used to fit the ranger model.
importance: A list containing the vector of variable importance as originally returned by ranger (scaled or not depending on the value of 'scaled.importance'), a data frame with the predictors ordered by their importance, and a ggplot showing the importance values.
performance: With the out-of-bag R squared, pseudo R squared, RMSE and NRMSE of the model.
residuals: residuals, normality test of the residuals computed with residuals_test(), and spatial autocorrelation of the residuals computed with moran_multithreshold().
spatial: A list with four slots:
- method: Character, method used to generate, rank, and select spatial predictors.
- names: Character vector with the names of the selected spatial predictors. Not returned if the method is "hengl".
- optimization: Criteria used to select the spatial predictors. Not returned if the method is "hengl".
- plot: Plot of the criteria used to select the spatial predictors. Not returned if the method is "hengl".

Examples

if(interactive()){

 #loading example data
 data(distance_matrix)
 data(plant_richness_df)

 #names of the response and predictors
 dependent.variable.name <- "richness_species_vascular"
 predictor.variable.names <- colnames(plant_richness_df)[5:21]

 #hengl
 model <- rf_spatial(
   data = plant_richness_df,
   dependent.variable.name = dependent.variable.name,
   predictor.variable.names = predictor.variable.names,
   distance.matrix = distance_matrix,
   distance.thresholds = 0,
   method = "hengl",
   n.cores = 1
 )

 #mem.moran.sequential
 model <- rf_spatial(
   data = plant_richness_df,
   dependent.variable.name = dependent.variable.name,
   predictor.variable.names = predictor.variable.names,
   distance.matrix = distance_matrix,
   distance.thresholds = 0,
   method = "mem.moran.sequential",
   n.cores = 1
 )

 #fitting an rf_spatial model from an rf model
 rf.model <- rf(
   data = plant_richness_df,
   dependent.variable.name = "richness_species_vascular",
   predictor.variable.names = colnames(plant_richness_df)[5:21],
   distance.matrix = distance_matrix,
   distance.thresholds = 0,
   n.cores = 1,
   verbose = FALSE
 )
 rf.model$spatial.correlation.residuals$plot

 #spatial version of the rf model
 rf.spatial <- rf_spatial(model = rf.model)
 rf.spatial$spatial.correlation.residuals$plot


}

[Package spatialRF version 1.1.4 Index]