R: Generates Recalibrated Samples of the Predictive Distribution

recalibrate {recalibratiNN}

R Documentation

Generates Recalibrated Samples of the Predictive Distribution

Description

This function offers recalibration techniques for regression models that assume Gaussian distributions by using the Mean Squared Error (MSE) as the loss function. Based on the work by Torres R. et al. (2024), it supports both local and global recalibration approaches to provide samples from a recalibrated predictive distribution. A detailed algorithm can also be found in Musso C. (2023).

Usage

recalibrate(
  yhat_new,
  pit_values,
  mse,
  space_cal = NULL,
  space_new = NULL,
  type = c("local", "global"),
  p_neighbours = 0.1,
  epsilon = 0
)

Arguments

`yhat_new`	Numeric vector with predicted response values for the new (or test) set.
`pit_values`	Numeric vector of Global Probability Integral Transform (PIT) values calculated on the calibration set. We recommend using the PIT_global function.
`mse`	Mean Squared Error calculated from the calibration/validation set.
`space_cal`	Numeric matrix or data frame representing the covariates/features of the calibration/validation set, or any intermediate representation (like an intermediate layer of a neural network).
`space_new`	Similar to space_cal, but for a new set of covariates/features, ensuring they are in the same space as those in space_cal for effective local recalibration.
`type`	Character string to choose between 'local' or 'global' calibration.
`p_neighbours`	Proportion (0,1] of the calibration dataset to be considered for determining the number of neighbors in the KNN method. Default is set to 0.1. With p_neighbours=1, calibration is global but weighted by distance.
`epsilon`	Numeric value for approximation in the K-nearest neighbors (KNN) method. Default is 0, indicating exact distances.

Details

The calibration technique implemented here draws inspiration from Approximate Bayesian Computation and Inverse Transform Theorem, allowing for recalibration either locally or globally. The global method employs a uniform kernel, while the local method employs an Epanechnikov kernel.

It's important to note that the least squares method will only yield a probabilistic interpretation if the output to be modeled follows a normal distribution, and this assumption was used to implement this function.

The local recalibration method is expected to improve the predictive performance of the model, especially when the model is not able to capture the heteroscedasticity of the data. However, there is a trade off between refinement of localization and the Monte Carlo error, which can be controlled by the number of neighbors. That is, when more localized, the recalibration will grasp local changes better, but the Monte Carlo error will increase, because of the reduced number of neighbors.

When p_neighbours=1, recalibration is performed using the entire calibration dataset but with distance-weighted contributions.

Value

A list containing the calibrated predicted mean and variance, along with samples from the recalibrated predictive distribution and their respective weights calculated using an Epanechnikov kernel over the distances obtained from KNN.

References

Torres R, Nott DJ, Sisson SA, Rodrigues T, Reis JG, Rodrigues GS (2024). “Model-Free Local Recalibration of Neural Networks.” arXiv preprint arXiv:2403.05756. doi:10.48550/arXiv.2403.05756. Musso C (2023). “Recalibration of Gaussian Neural Network Regression Models: The RecalibratiNN Package.” Undergraduate Thesis (Bachelor in Statistics), University of Brasília. Available at: https://bdm.unb.br/handle/10483/38504.

Examples


n <- 1000
split <- 0.8

# Auxiliary functions
mu <- function(x1){
10 + 5*x1^2
}

sigma_v <- function(x1){
30*x1
}

# Generating heteroscedastic data.
x <- runif(n, 1, 10)
y <- rnorm(n, mu(x), sigma_v(x))

# Train set
x_train <- x[1:(n*split)]
y_train <- y[1:(n*split)]

# Calibration/Validation set.
x_cal <- x[(n*split+1):n]
y_cal <- y[(n*split+1):n]

# New observations or the test set.
x_new <- runif(n/5, 1, 10)

# Fitting a simple linear regression, which will not capture the heteroscedasticity
model <- lm(y_train ~ x_train)

y_hat_cal <- predict(model, newdata=data.frame(x_train=x_cal))
MSE_cal <- mean((y_hat_cal - y_cal)^2)

y_hat_new <- predict(model, newdata=data.frame(x_train=x_new))

pit <- PIT_global(ycal=y_cal, yhat= y_hat_cal, mse=MSE_cal)

recalibrate(
  space_cal=x_cal,
  space_new=x_new,
  yhat_new=y_hat_new,
  pit_values=pit,
  mse= MSE_cal,
  type="local")

[Package recalibratiNN version 0.3.0 Index]