R: Sequential introduction of spatial predictors into a model

select_spatial_predictors_sequential {spatialRF}

R Documentation

Sequential introduction of spatial predictors into a model

Description

Selects spatial predictors by adding them sequentially into a model while monitoring the Moran's I of the model residuals and the model's R-squared. Once all the available spatial predictors have been added to the model, the function identifies the first n predictors that minimize the spatial correlation of the residuals and maximize R-squared, and returns the names of the selected spatial predictors and a data frame with the selection criteria.

Usage

select_spatial_predictors_sequential(
  data = NULL,
  dependent.variable.name = NULL,
  predictor.variable.names = NULL,
  distance.matrix = NULL,
  distance.thresholds = NULL,
  ranger.arguments = NULL,
  spatial.predictors.df = NULL,
  spatial.predictors.ranking = NULL,
  weight.r.squared = 0.75,
  weight.penalization.n.predictors = 0.25,
  verbose = FALSE,
  n.cores = parallel::detectCores() - 1,
  cluster = NULL
)

Arguments

`data`	Data frame with a response variable and a set of predictors. Default: `NULL`
`dependent.variable.name`	Character string with the name of the response variable. Must be in the column names of `data`. Default: `NULL`
`predictor.variable.names`	Character vector with the names of the predictive variables. Every element of this vector must be in the column names of `data`. Default: `NULL`
`distance.matrix`	Squared matrix with the distances among the records in `data`. The number of rows of `distance.matrix` and `data` must be the same. If not provided, the computation of the Moran's I of the residuals is omitted. Default: `NULL`
`distance.thresholds`	Numeric vector with neighborhood distances. All distances in the distance matrix below each value in `dustance.thresholds` are set to 0 for the computation of Moran's I. If `NULL`, it defaults to seq(0, max(distance.matrix), length.out = 4). Default: `NULL`
`ranger.arguments`	Named list with ranger arguments (other arguments of this function can also go here). All ranger arguments are set to their default values except for 'importance', that is set to 'permutation' rather than 'none'. Please, consult the help file of ranger if you are not familiar with the arguments of this function.
`spatial.predictors.df`	Data frame of spatial predictors.
`spatial.predictors.ranking`	Ranking of the spatial predictors returned by `rank_spatial_predictors()`.
`weight.r.squared`	Numeric between 0 and 1, weight of R-squared in the optimization index. Default: `0.75`
`weight.penalization.n.predictors`	Numeric between 0 and 1, weight of the penalization for the number of spatial predictors added in the optimization index. Default: `0.25`
`verbose`	Logical, ff `TRUE`, messages and plots generated during the execution of the function are displayed, Default: `FALSE`
`n.cores`	Integer, number of cores to use. Default: `parallel::detectCores() - 1`
`cluster`	A cluster definition generated by `parallel::makeCluster()`. Default: `NULL`

Details

The algorithm works as follows: If the function rank_spatial_predictors returns 10 spatial predictors (sp1 to sp10, ordered from best to worst), select_spatial_predictors_sequential is going to fit the models y ~ predictors + sp1, y ~ predictors + sp1 + sp2, until all spatial predictors are used in ⁠y ~ predictors + sp1 ... sp10⁠. The model with lower Moran's I of the residuals and higher R-squared (computed on the out-of-bag data) is selected, and its spatial predictors returned.

Value

A list with two slots: optimization, a data frame with the index of the spatial predictor added on each iteration, the spatial correlation of the model residuals, and the R-squared of the model, and best.spatial.predictors, that is a character vector with the names of the spatial predictors that minimize the Moran's I of the residuals and maximize the R-squared of the model.

Examples

if(interactive()){

#loading example data
data(distance_matrix)
data(plant_richness_df)

#common arguments
dependent.variable.name = "richness_species_vascular"
predictor.variable.names = colnames(plant_richness_df)[5:21]

#non-spatial model
model <- rf(
  data = plant_richness_df,
  dependent.variable.name = dependent.variable.name,
  predictor.variable.names = predictor.variable.names,
  distance.matrix = distance_matrix,
  distance.thresholds = 0,
  n.cores = 1
)

#preparing spatial predictors
spatial.predictors <- mem_multithreshold(
  distance.matrix = distance.matrix,
  distance.thresholds = 0
)
#ranking spatial predictors by their Moran's I (faster option)
spatial.predictors.ranking <- rank_spatial_predictors(
  ranking.method = "moran",
  spatial.predictors.df = spatial.predictors,
  reference.moran.i = model$spatial.correlation.residuals$max.moran,
  distance.matrix = distance.matrix,
  distance.thresholds = 0,
  n.cores = 1
)

#selecting the best subset of predictors
selection <- select_spatial_predictors_sequential(
  data = plant_richness_df,
  dependent.variable.name = dependent.variable.name,
  predictor.variable.names = predictor.variable.names,
  distance.matrix = distance_matrix,
  distance.thresholds = 0,
  spatial.predictors.df = spatial.predictors,
  spatial.predictors.ranking = spatial.predictors.ranking,
  n.cores = 1
)

selection$optimization
selection$best.spatial.predictors
plot_optimization(selection$optimization)

}

[Package spatialRF version 1.1.4 Index]