R: Predict spatial variables using machine learning

spatPredict {SAiVE}

R Documentation

Predict spatial variables using machine learning

Description

Function to facilitate the prediction of spatial variables using machine learning, including the selection of a particular model and/or model parameters from several user-defined options. Both classification and regression is supported, though please ensure that the models passed to the parameter methods are suitable.

Note that you may need to acquiesce to installing supplementary packages, depending on the model types chosen and whether or not these have been run before; this function may not be 'set and forget'.

It is possible to specify multiple machine learning methods (the methods parameter) as well as method-specific parameters (the trainControl parameter) if you wish to test multiple options and select the best one. To facilitate method selection, refer to function modelMatch(). If you are unsure of the best model to use, you can use the fastCompare parameter to quickly compare models and select the best one based on accuracy. If you wish to use a single model and/or trainControl object, you can pass a single string to methods and a single trainControl object to trainControl.

Warning options are changed for this function only to show all warnings as they occur and reset back to their original state upon function completion (a test is done first to ensure it can be reset). This is to ensure that any warnings when running models are shown in sequence with the messages indicating the progress of the function, especially when running multiple models and/or trainControl options.

Usage

spatPredict(
  features,
  outcome,
  poly_sample = 1000,
  trainControl,
  methods,
  fastCompare = TRUE,
  fastFraction = NULL,
  thinFeatures = TRUE,
  predict = FALSE,
  n.cores = NULL,
  save_path = NULL
)

Arguments

`features`	Independent variables. Must be either a NAMED list of terra spatRasters or a multi-layer (stacked) spatRaster (c(rast1, rast2). All layers must all have the same cell size, alignment, extent, and crs. These rasters should include the training extent (that covered by the spatVector in `outcome`) as well as the desired extrapolation extent.
`outcome`	Dependent variable, as a terra spatVector of points or polygons with a single attribute table column (of class integer, numeric or factor). The class of this column dictates whether the problem is approached as a classification or regression problem; see details. If specifying polygons, stratified random sampling will be done with `poly_sample` number of points per unique polygon value.
`poly_sample`	If passing a polygon SpatVector to `outcome`, the number of points to generate from the polygons for each unique polygon value.
`trainControl`	Parameters used to control training of the machine learning model, created with `caret::trainControl()`. Passed to the `trControl` parameter of `caret::train()`. If specifying multiple methods in `methods` you can use a single `trainControl` which will apply to all `methods`, or pass multiple variations to this argument as a list with names matching the names of `methods` (one element for each model specified in methods).
`methods`	A string specifying one or more classification/regression methods(s) to use. Passed to the `method` parameter of `caret::train()`. If specifying more than one method they will all be passed to `caret::resamples()` to compare method performance. Then, if `predict = TRUE`, the method with the highest overall accuracy will be selected to predict the raster surface across the exent of `features`. A different `trainControl` parameter can be used for each method in `methods`.
`fastCompare`	If specifying multiple methods in `methods` or one method with multiple `trainControl` objects, should the points in `outcome` be sub-sampled for the comparison step? The selected method will be trained on the full `outcome` data set after selection. This only applies if `methods` is length > 3, with behavior further modified by fastFraction.
`fastFraction`	The fraction of points to use for the method comparison step (final training and testing is always done on the full data set) if `fastCompare` is TRUE and multiple methods . Default NULL ranges from 1 for 5000 or fewer points to 0.1 for 50 000 or more points. You can also set this to any value between 0 and 1 to override this behavior.
`thinFeatures`	Should random forest selection using `VSURF::VSURF()` be used in an attempt to remove irrelevant variables?
`predict`	TRUE will apply the trained model to the full extent of `features` and return a raster saved to `save_path`.
`n.cores`	The maximum number of cores to use. Leave NULL to use all cores minus 1.
`save_path`	The path (folder) to which you wish to save the predicted raster. Not used unless `predict = TRUE`.

Details

This function partly operates as a convenient means of passing various parameters to the caret::train() function, enabling the user to rapidly trial different model types and parameter sets. In addition, pre-processing of data can optionally be done using VSURF::VSURF() (parameter thinFeatures) which can decrease the time to run models by removing superfluous parameters.

Value

If passing only one method to the method argument: the outcome of the VSURF variable selection process (if thinFeatures is TRUE), the training and testing data.frames, the fitted model, model performance statistics, and the final predicted raster (if predict = TRUE).

If passing multiple methods to the method argument: the outcome of the VSURF variable selection process (if thinFeatures is TRUE), the training and testing data.frames, character vectors for failed methods, methods which generated a warning, and what those errors and warnings were, model performance comparison (if methods includes more than one method), the selected method, the trained model performance statistics, and the final predicted raster (if predict = TRUE).

In either case, the predicted raster is written to disk if save_path is specified.

Model testing, comparison, and reported metrics

After extracting raster values at n points from the features rasters the point values are split spatially into training and testing sets along a 70/30 split. This is accomplished by creating a grid (1000*1000) of polygons over the extent of the points and randomly assigning polygons to training or testing sets. Points within these polygons are then assigned to the corresponding set, ensuring that the training and testing sets are spatially independent.

Method for selecting the best model:

When specifying multiple model types inmethods, each model type and trainControl pair (if trainControl is a list of length equal to methods) is run using caret::train(). To speed things up you can use fastCompare = TRUE. Models are then compared on their 'accuracy' metric as output by caret::resamples() when run on the testing partition, and the highest-performing model is selected. If fastCompare is TRUE, this model is then run on the complete data set provided in outcome. Model statistics are returned upon function completion, which allows the user to select their own 'best performing' model based on other criteriaif desired.

Balancing classes in outcome (dependent) variable

Models can be biased if they are given significantly more points in one outcome class vs others, and best practice is to even out the number of points in each class. If extracting point values from a vector or raster object and passing a points vector object to this function, a simple way to do that is by using the "strata" parameter if using terra::spatSample(). If working directly from points, caret::downSample() and caret::upSample() can be used. See this link for more information. Note that if passing a polygons object to this function stratified random sampling will automatically be performed.

Classification or regression

Whether this function treats your inputs as a classification or regression problem depends on the class attached to the outcome variable. A class factor will be treated as a classification problem while all other classes will be treated as regression problems.

Author(s)

Ghislain de Laplante (gdela069@uottawa.ca or ghislain.delaplante@yukon.ca)

Examples


# These examples can take a while to run!

# Install packages underpinning examples
rlang::check_installed("ranger", reason = "required to run example.")
rlang::check_installed("Rborist", reason = "required to run example.")

# Single model, single trainControl

trainControl <- caret::trainControl(
                method = "repeatedcv",
                number = 2, # 2-fold Cross-validation
                repeats = 2, # repeated 2 times
                verboseIter = FALSE,
                returnResamp = "final",
                savePredictions = "all",
                allowParallel = TRUE)

 outcome <- permafrost_polygons
 outcome$Type <- as.factor(outcome$Type)

result <- spatPredict(features = c(aspect, solrad, slope),
  outcome = outcome,
  poly_sample = 100,
  trainControl = trainControl,
  methods = "ranger",
  n.cores = 2,
  predict = TRUE)

terra::plot(result$prediction)


# Multiple models, multiple trainControl

trainControl <- list("ranger" = caret::trainControl(
                                  method = "repeatedcv",
                                  number = 2,
                                  repeats = 2,
                                  verboseIter = FALSE,
                                  returnResamp = "final",
                                  savePredictions = "all",
                                  allowParallel = TRUE),
                     "Rborist" = caret::trainControl(
                                   method = "boot",
                                   number = 2,
                                   repeats = 2,
                                   verboseIter = FALSE,
                                   returnResamp = "final",
                                   savePredictions = "all",
                                   allowParallel = TRUE)
                                   )

result <- spatPredict(features = c(aspect, solrad, slope),
  outcome = outcome,
  poly_sample = 100,
  trainControl = trainControl,
  methods = c("ranger", "Rborist"),
  n.cores = 2,
  predict = TRUE)

terra::plot(result$prediction)

[Package SAiVE version 1.0.6 Index]