aoa {CAST} | R Documentation |

This function estimates the Dissimilarity Index (DI) and the derived Area of Applicability (AOA) of spatial prediction models by considering the distance of new data (i.e. a SpatRaster of spatial predictors used in the models) in the predictor variable space to the data used for model training. Predictors can be weighted based on the internal variable importance of the machine learning algorithm used for model training. The AOA is derived by applying a threshold on the DI which is the (outlier-removed) maximum DI of the cross-validated training data.

```
aoa(
newdata,
model = NA,
trainDI = NA,
train = NULL,
weight = NA,
variables = "all",
CVtest = NULL,
CVtrain = NULL,
method = "L2",
useWeight = TRUE
)
```

`newdata` |
A SpatRaster, stars object or data.frame containing the data the model was meant to make predictions for. |

`model` |
A train object created with caret used to extract weights from (based on variable importance) as well as cross-validation folds. See examples for the case that no model is available or for models trained via e.g. mlr3. |

`trainDI` |
A trainDI object. Optional if |

`train` |
A data.frame containing the data used for model training. Optional. Only required when no model is given |

`weight` |
A data.frame containing weights for each variable. Optional. Only required if no model is given. |

`variables` |
character vector of predictor variables. if "all" then all variables of the model are used or if no model is given then of the train dataset. |

`CVtest` |
list or vector. Either a list where each element contains the data points used for testing during the cross validation iteration (i.e. held back data). Or a vector that contains the ID of the fold for each training point. Only required if no model is given. |

`CVtrain` |
list. Each element contains the data points used for training during the cross validation iteration (i.e. held back data).
Only required if no model is given and only required if CVtrain is not the opposite of CVtest (i.e. if a data point is not used for testing, it is used for training).
Relevant if some data points are excluded, e.g. when using |

`method` |
Character. Method used for distance calculation. Currently euclidean distance (L2) and Mahalanobis distance (MD) are implemented but only L2 is tested. Note that MD takes considerably longer. |

`useWeight` |
Logical. Only if a model is given. Weight variables according to importance in the model? |

The Dissimilarity Index (DI) and the corresponding Area of Applicability (AOA) are calculated. If variables are factors, dummy variables are created prior to weighting and distance calculation.

Interpretation of results: If a location is very similar to the properties of the training data it will have a low distance in the predictor variable space (DI towards 0) while locations that are very different in their properties will have a high DI. See Meyer and Pebesma (2021) for the full documentation of the methodology.

An object of class `aoa`

containing:

`parameters` |
object of class trainDI. see |

`DI` |
SpatRaster, stars object or data frame. Dissimilarity index of newdata |

`AOA` |
SpatRaster, stars object or data frame. Area of Applicability of newdata. AOA has values 0 (outside AOA) and 1 (inside AOA) |

If classification models are used, currently the variable importance can only be automatically retrieved if models were trained via train(predictors,response) and not via the formula-interface. Will be fixed.

Hanna Meyer

Meyer, H., Pebesma, E. (2021): Predicting into unknown space? Estimating the area of applicability of spatial prediction models. Methods in Ecology and Evolution 12: 1620-1633. doi:10.1111/2041-210X.13650

```
## Not run:
library(sf)
library(terra)
library(caret)
library(viridis)
library(latticeExtra)
# prepare sample data:
dat <- get(load(system.file("extdata","Cookfarm.RData",package="CAST")))
dat <- aggregate(dat[,c("VW","Easting","Northing")],by=list(as.character(dat$SOURCEID)),mean)
pts <- st_as_sf(dat,coords=c("Easting","Northing"))
pts$ID <- 1:nrow(pts)
set.seed(100)
pts <- pts[1:30,]
studyArea <- rast(system.file("extdata","predictors_2012-03-25.grd",package="CAST"))[[1:8]]
trainDat <- extract(studyArea,pts,na.rm=FALSE)
trainDat <- merge(trainDat,pts,by.x="ID",by.y="ID")
# visualize data spatially:
plot(studyArea)
plot(studyArea$DEM)
plot(pts[,1],add=TRUE,col="black")
# train a model:
set.seed(100)
variables <- c("DEM","NDRE.Sd","TWI")
model <- train(trainDat[,which(names(trainDat)%in%variables)],
trainDat$VW, method="rf", importance=TRUE, tuneLength=1,
trControl=trainControl(method="cv",number=5,savePredictions=T))
print(model) #note that this is a quite poor prediction model
prediction <- predict(studyArea,model,na.rm=TRUE)
plot(varImp(model,scale=FALSE))
#...then calculate the AOA of the trained model for the study area:
AOA <- aoa(studyArea,model)
plot(AOA)
####
#The AOA can also be calculated without a trained model.
#All variables are weighted equally in this case:
####
AOA <- aoa(studyArea,train=trainDat,variables=variables)
####
# The AOA can also be used for models trained via mlr3 (parameters have to be assigned manually):
####
library(mlr3)
library(mlr3learners)
library(mlr3spatial)
library(mlr3spatiotempcv)
library(mlr3extralearners)
# initiate and train model:
train_df <- trainDat[, c("DEM","NDRE.Sd","TWI", "VW")]
backend <- as_data_backend(train_df)
task <- as_task_regr(backend, target = "VW")
lrn <- lrn("regr.randomForest", importance = "mse")
lrn$train(task)
# cross-validation folds
rsmp_cv <- rsmp("cv", folds = 5L)$instantiate(task)
## predict:
prediction <- predict(studyArea,lrn$model,na.rm=TRUE)
### Estimate AOA
AOA <- aoa(studyArea,
train = as.data.frame(task$data()),
variables = task$feature_names,
weight = data.frame(t(lrn$importance())),
CVtest = rsmp_cv$instance[order(row_id)]$fold)
## End(Not run)
```

[Package *CAST* version 0.8.1 Index]