get_all_performance_boot {Indicator}R Documentation

Function to evaluate different nan imputation methods with bootstrap

Description

The get_all_performance_boot function is designed to evaluate different methods of imputing missing values into a dataset. The evaluation is performed using bootstrapping to ensure robustness of the results

Usage

get_all_performance_boot(data, to_impute, regressors, nb = 1)

Arguments

data

dataframe with rows = observations and columns = quantitative variables

to_impute

string , name of the variables where there are NANs to impute

regressors

vector of string with names of the variables to use to apply 1st, 4th imputation method

nb

number of bootstrap samples

Details

The function calculates performance metrics, such as:

- R^2= [1/N * [({\sum_{i=1}^N(P_i - (\bar{P})(O_i - (\bar{O})]/\sigma_{P}*\sigma_{O}]^2},

- RMSE= (1/N * ({\sum_{i=1}^N(P_i - O_i)^2)^{1/2}}

and

- MAE = 1/N * {\sum_{i=1}^N|{P_i - O_i}|}

for each imputation method

Supported Imputation Methods:

1. Linear Regression Imputation (lm_imputation): it uses a linear regression model to predict and impute missing values

2. Median Imputation (median_imputation): it replaces missing values with the median of observed values

3. Mean Imputation (mean_imputation): it replaces missing values with the mean of observed values

4. Hot Deck Imputation (hot_deck_imputation): it replaces missing values with similar observed values

5. Expectation-Maximization Imputation (EM_imputation): it uses the Expectation-Maximization algorithm to estimate and impute missing values

Evaluate different methods of imputing missing values using bootstrapping and calculate performance metrics for each method

Value

It returns a performance measures dataframe with rows = methods and columns = methods' performances averaged over bootstraps.

Note

For the methods Median Imputation and Mean Imputation, it is not possible to calculate the R^2 value. This is because the standard deviation is zero based on the following R^2 formula:

R^2= [1/N * [({\sum_{i=1}^N(P_i - (\bar{P})(O_i - (\bar{O})]/\sigma_{P}*\sigma_{O}]^2}

where:

- N is the number of imputations,

- O_i are the observed data point,

- P_i are the imputed data point,

- \bar{O} are the average of the observed data,

- \bar{P} are the average of the imputed data,

- \sigma_{P} are the standard deviation of the imputed data,

- \sigma_{O} are the standard deviation of the observed data.

References

OECD/European Union/EC-JRC (2008), Handbook on Constructing Composite Indicators: Methodology and User Guide, OECD Publishing, Paris, <https://doi.org/10.1787/9789264043466-en>

Examples

data("airquality")
regressors<-colnames(airquality[,c(3,4)])
suppressWarnings(get_all_performance_boot(data =airquality,"Ozone",regressors = regressors,nb=100))


[Package Indicator version 0.1.2 Index]