spatPredict {SAiVE}R Documentation

Predict spatial variables using machine learning

Description

[Stable]

Function to facilitate the prediction of spatial variables using machine learning, including the selection of a particular model and/or model parameters from several user-defined options. Both classification and regression is supported, though please ensure that the models passed to the parameter methods are suitable.

Note that you may need to acquiesce to installing supplementary packages, depending on the model types chosen and whether or not these have been run before; this function may not be 'set and forget'.

It is possible to specify multiple machine learning methods (the methods parameter) as well as method-specific parameters (the trainControl parameter) if you wish to test multiple options and select the best one. To facilitate method selection, refer to function modelMatch(). If you are unsure of the best model to use, you can use the fastCompare parameter to quickly compare models and select the best one based on accuracy. If you wish to use a single model and/or trainControl object, you can pass a single string to methods and a single trainControl object to trainControl.

Warning options are changed for this function only to show all warnings as they occur and reset back to their original state upon function completion (a test is done first to ensure it can be reset). This is to ensure that any warnings when running models are shown in sequence with the messages indicating the progress of the function, especially when running multiple models and/or trainControl options.

Usage

spatPredict(
  features,
  outcome,
  poly_sample = 1000,
  trainControl,
  methods,
  fastCompare = TRUE,
  fastFraction = NULL,
  thinFeatures = TRUE,
  predict = FALSE,
  n.cores = NULL,
  save_path = NULL
)

Arguments

features

Independent variables. Must be either a NAMED list of terra spatRasters or a multi-layer (stacked) spatRaster (c(rast1, rast2). All layers must all have the same cell size, alignment, extent, and crs. These rasters should include the training extent (that covered by the spatVector in outcome) as well as the desired extrapolation extent.

outcome

Dependent variable, as a terra spatVector of points or polygons with a single attribute table column (of class integer, numeric or factor). The class of this column dictates whether the problem is approached as a classification or regression problem; see details. If specifying polygons, stratified random sampling will be done with poly_sample number of points per unique polygon value.

poly_sample

If passing a polygon SpatVector to outcome, the number of points to generate from the polygons for each unique polygon value.

trainControl

Parameters used to control training of the machine learning model, created with caret::trainControl(). Passed to the trControl parameter of caret::train(). If specifying multiple methods in methods you can use a single trainControl which will apply to all methods, or pass multiple variations to this argument as a list with names matching the names of methods (one element for each model specified in methods).

methods

A string specifying one or more classification/regression methods(s) to use. Passed to the method parameter of caret::train(). If specifying more than one method they will all be passed to caret::resamples() to compare method performance. Then, if predict = TRUE, the method with the highest overall accuracy will be selected to predict the raster surface across the exent of features. A different trainControl parameter can be used for each method in methods.

fastCompare

If specifying multiple methods in methods or one method with multiple trainControl objects, should the points in outcome be sub-sampled for the comparison step? The selected method will be trained on the full outcome data set after selection. This only applies if methods is length > 3, with behavior further modified by fastFraction.

fastFraction

The fraction of points to use for the method comparison step (final training and testing is always done on the full data set) if fastCompare is TRUE and multiple methods . Default NULL ranges from 1 for 5000 or fewer points to 0.1 for 50 000 or more points. You can also set this to any value between 0 and 1 to override this behavior.

thinFeatures

Should random forest selection using VSURF::VSURF() be used in an attempt to remove irrelevant variables?

predict

TRUE will apply the trained model to the full extent of features and return a raster saved to save_path.

n.cores

The maximum number of cores to use. Leave NULL to use all cores minus 1.

save_path

The path (folder) to which you wish to save the predicted raster. Not used unless predict = TRUE.

Details

This function partly operates as a convenient means of passing various parameters to the caret::train() function, enabling the user to rapidly trial different model types and parameter sets. In addition, pre-processing of data can optionally be done using VSURF::VSURF() (parameter thinFeatures) which can decrease the time to run models by removing superfluous parameters.

Value

If passing only one method to the method argument: the outcome of the VSURF variable selection process (if thinFeatures is TRUE), the training and testing data.frames, the fitted model, model performance statistics, and the final predicted raster (if predict = TRUE).

If passing multiple methods to the method argument: the outcome of the VSURF variable selection process (if thinFeatures is TRUE), the training and testing data.frames, character vectors for failed methods, methods which generated a warning, and what those errors and warnings were, model performance comparison (if methods includes more than one method), the selected method, the trained model performance statistics, and the final predicted raster (if predict = TRUE).

In either case, the predicted raster is written to disk if save_path is specified.

Model testing, comparison, and reported metrics

After extracting raster values at n points from the features rasters the point values are split spatially into training and testing sets along a 70/30 split. This is accomplished by creating a grid (1000*1000) of polygons over the extent of the points and randomly assigning polygons to training or testing sets. Points within these polygons are then assigned to the corresponding set, ensuring that the training and testing sets are spatially independent.

Method for selecting the best model:

When specifying multiple model types inmethods, each model type and trainControl pair (if trainControl is a list of length equal to methods) is run using caret::train(). To speed things up you can use fastCompare = TRUE. Models are then compared on their 'accuracy' metric as output by caret::resamples() when run on the testing partition, and the highest-performing model is selected. If fastCompare is TRUE, this model is then run on the complete data set provided in outcome. Model statistics are returned upon function completion, which allows the user to select their own 'best performing' model based on other criteriaif desired.

Balancing classes in outcome (dependent) variable

Models can be biased if they are given significantly more points in one outcome class vs others, and best practice is to even out the number of points in each class. If extracting point values from a vector or raster object and passing a points vector object to this function, a simple way to do that is by using the "strata" parameter if using terra::spatSample(). If working directly from points, caret::downSample() and caret::upSample() can be used. See this link for more information. Note that if passing a polygons object to this function stratified random sampling will automatically be performed.

Classification or regression

Whether this function treats your inputs as a classification or regression problem depends on the class attached to the outcome variable. A class factor will be treated as a classification problem while all other classes will be treated as regression problems.

Author(s)

Ghislain de Laplante (gdela069@uottawa.ca or ghislain.delaplante@yukon.ca)

Examples


# These examples can take a while to run!

# Install packages underpinning examples
rlang::check_installed("ranger", reason = "required to run example.")
rlang::check_installed("Rborist", reason = "required to run example.")

# Single model, single trainControl

trainControl <- caret::trainControl(
                method = "repeatedcv",
                number = 2, # 2-fold Cross-validation
                repeats = 2, # repeated 2 times
                verboseIter = FALSE,
                returnResamp = "final",
                savePredictions = "all",
                allowParallel = TRUE)

 outcome <- permafrost_polygons
 outcome$Type <- as.factor(outcome$Type)

result <- spatPredict(features = c(aspect, solrad, slope),
  outcome = outcome,
  poly_sample = 100,
  trainControl = trainControl,
  methods = "ranger",
  n.cores = 2,
  predict = TRUE)

terra::plot(result$prediction)


# Multiple models, multiple trainControl

trainControl <- list("ranger" = caret::trainControl(
                                  method = "repeatedcv",
                                  number = 2,
                                  repeats = 2,
                                  verboseIter = FALSE,
                                  returnResamp = "final",
                                  savePredictions = "all",
                                  allowParallel = TRUE),
                     "Rborist" = caret::trainControl(
                                   method = "boot",
                                   number = 2,
                                   repeats = 2,
                                   verboseIter = FALSE,
                                   returnResamp = "final",
                                   savePredictions = "all",
                                   allowParallel = TRUE)
                                   )

result <- spatPredict(features = c(aspect, solrad, slope),
  outcome = outcome,
  poly_sample = 100,
  trainControl = trainControl,
  methods = c("ranger", "Rborist"),
  n.cores = 2,
  predict = TRUE)

terra::plot(result$prediction)


[Package SAiVE version 1.0.6 Index]