cv.rfsi {meteo}R Documentation

Nested k-fold cross-validation for Random Forest Spatial Interpolation (RFSI)

Description

Function for nested k-fold cross-validation function for Random Forest Spatial Interpolation (RFSI) (Sekulić et al. 2020). It is based on rfsi, pred.rfsi, and tune.rfsi functions. Currently, only spatial (leave-location-out) cross-validation is implemented. Temporal and spatio-temporal cross-validation will be implemented in the future.

Usage

cv.rfsi(formula,
        data,
        data.staid.x.y.z = NULL,
        use.idw = FALSE,
        s.crs = NA,
        p.crs = NA,
        tgrid,
        tgrid.n=10,
        tune.type = "LLO",
        k = 5,
        seed=42,
        out.folds,
        in.folds,
        acc.metric,
        output.format = "data.frame",
        cpus = detectCores()-1,
        progress = 1,
        soil3d = FALSE,
        no.obs = 'increase',
        ...)

Arguments

formula

formula; Formula for specifying target variable and covariates (without nearest observations and distances to them). If z~1, an RFSI model using only nearest obsevrations and distances to them as covariates will be cross-validated.

data

sf-class, sftime-class, SpatVector-class or data.frame; Contains target variable (observations) and covariates used for making an RFSI model. If data.frame object, it should have next columns: station ID (staid), longitude (x), latitude (y), 3rd component - time, depth, ... (z) of the observation, observation value (obs) and covariates (cov1, cov2, ...). If covariates are missing, the RFSI model using only nearest obsevrations and distances to them as covariates (formula=z~1) will be cross-validated.

data.staid.x.y.z

numeric or character vector; Positions or names of the station ID (staid), longitude (x), latitude (y) and 3rd component (z) columns in data.frame object (e.g. c(1,2,3,4)). If data is sf-class, sftime-class, or SpatVector-class object, data.staid.x.y.z is used to point staid and z position. Set z position to NA (e.g. c(1,2,3,NA)) or ommit it (e.g. c(1,2,3)) for spatial interpolation. Default is NULL.

use.idw

boolean; IDW prediction as covariate - will IDW predictions from n.obs nearest observations be calculated and tuned (see function near.obs). Default is FALSE.

s.crs

st_crs or crs; Source CRS of data. If data contains crs, s.crs will be overwritten. Default is NA.

p.crs

st_crs or crs; Projection CRS for data reprojection. If NA, s.crs will be used for distance calculation. Note that observations should be in projection for finding nearest observations based on Eucleadean distances (see function near.obs). Default is NA.

tgrid

data.frame; Possible tuning parameters for nested folds. The column names are same as the tuning parameters. Possible tuning parameters are: n.obs, num.trees, mtry, min.node.size, sample.fraction, splirule, idw.p, and depth.range.

tgrid.n

numeric; Number of randomly chosen tgrid combinations for nested tuning of RFSI. If larger than tgrid, will be set to length(tgrid)

tune.type

character; Type of nested cross-validation: leave-location-out ("LLO"), leave-time-out ("LTO") - TO DO, and leave-location-time-out ("LLTO") - TO DO. Default is "LLO".

k

numeric; Number of random outer and inner folds (i.e. for cross-validation and nested tuning) that will be created with CreateSpacetimeFolds function. Default is 5.

seed

numeric; Random seed that will be used to generate outer and inner folds with CreateSpacetimeFolds function.

out.folds

numeric or character vector or value; Showing outer folds column (if value) or rows (vector) of data observations used for cross-validation. If missing, will be created with CreateSpacetimeFolds function.

in.folds

numeric or character vector or value; Showing innner folds column (if value) or rows (vector) of data observations used for cross-validation. If missing, will be created with CreateSpacetimeFolds function.

acc.metric

character; Accuracy metric that will be used as a criteria for choosing an optimal RFSI model in nested tuning. Possible values for regression: "ME", "MAE", "NMAE", "RMSE" (default), "NRMSE", "R2", "CCC". Possible values for classification: "Accuracy","Kappa" (default), "AccuracyLower", "AccuracyUpper", "AccuracyNull", "AccuracyPValue", "McnemarPValue".

output.format

character; Format of the output, data.frame (default), sf-class, sftime-class, or SpatVector-class.

cpus

numeric; Number of processing units. Default is detectCores()-1.

progress

numeric; If progress bar is shown. 0 is no progress bar, 1 is outer folds results, 2 is + innner folds results, 3 is + prediction progress bar. Default is 1.

soil3d

logical; If 3D soil modellig is performed and near.obs.soil function is used for finding n nearest observations and distances to them. In this case, z position of the data.staid.x.y.z points to the depth column.

no.obs

character; Possible values are increase (default) and exactly. If set to increase, in case if there is no n.obs observations in depth.range for a specific location, the depth.range is increased (multiplied by 2, 3, ...) until the number of observations are larger or equal to n.obs. If set to exactly, the function will raise an error when it come to the first location with no n.obs observations in specified depth.range (see function near.obs.soil).

...

Further arguments passed to ranger.

Value

A data.frame, sf-class, sftime-class, or SpatVector-class object (depends on output.format argument), with columns:

obs

Observations.

pred

Predictions from cross-validation.

folds

Folds used for cross-validation.

Author(s)

Aleksandar Sekulic asekulic@grf.bg.ac.rs

References

Sekulić, A., Kilibarda, M., Heuvelink, G. B., Nikolić, M. & Bajat, B. Random Forest Spatial Interpolation. Remote. Sens. 12, 1687, https://doi.org/10.3390/rs12101687 (2020).

See Also

near.obs rfsi pred.rfsi tune.rfsi

Examples

library(CAST)
library(doParallel)
library(ranger)
library(sp)
library(sf)
library(terra)
library(meteo)

# prepare data
demo(meuse, echo=FALSE)
meuse <- meuse[complete.cases(meuse@data),]
data = st_as_sf(meuse, coords = c("x", "y"), crs = 28992, agr = "constant")
fm.RFSI <- as.formula("zinc ~ dist + soil + ffreq")

# making tgrid
n.obs <- 1:6
min.node.size <- 2:10
sample.fraction <- seq(1, 0.632, -0.05) # 0.632 without / 1 with replacement
splitrule <- "variance"
ntree <- 250 # 500
mtry <- 3:(2+2*max(n.obs))
tgrid = expand.grid(min.node.size=min.node.size, num.trees=ntree,
                    mtry=mtry, n.obs=n.obs, sample.fraction=sample.fraction)

# Cross-validation of RFSI
rfsi_cv <- cv.rfsi(formula=fm.RFSI, # without nearest obs
                   data = data,
                   tgrid = tgrid, # combinations for tuning
                   tgrid.n = 2, # number of randomly selected combinations from tgrid for tuning
                   tune.type = "LLO", # Leave-Location-Out CV
                   k = 5, # number of folds
                   seed = 42,
                   acc.metric = "RMSE", # R2, CCC, MAE
                   output.format = "sf", # "data.frame", # "SpatVector",
                   cpus=2, # detectCores()-1,
                   progress=1,
                   importance = "impurity") # ranger parameter

summary(rfsi_cv)
rfsi_cv$dif <- rfsi_cv$obs - rfsi_cv$pred
plot(rfsi_cv["dif"])
plot(rfsi_cv[, , "obs"])
acc.metric.fun(rfsi_cv$obs, rfsi_cv$pred, "R2")
acc.metric.fun(rfsi_cv$obs, rfsi_cv$pred, "RMSE")
acc.metric.fun(rfsi_cv$obs, rfsi_cv$pred, "MAE")
acc.metric.fun(rfsi_cv$obs, rfsi_cv$pred, "CCC")


[Package meteo version 2.0-3 Index]