R: Function to evaluate different nan imputation methods with...

get_all_performance_boot {Indicator}

R Documentation

Function to evaluate different nan imputation methods with bootstrap

Description

The get_all_performance_boot function is designed to evaluate different methods of imputing missing values into a dataset. The evaluation is performed using bootstrapping to ensure robustness of the results

Usage

get_all_performance_boot(data, to_impute, regressors, nb = 1)

Arguments

`data`	dataframe with rows = observations and columns = quantitative variables
`to_impute`	string , name of the variables where there are NANs to impute
`regressors`	vector of string with names of the variables to use to apply 1st, 4th imputation method
`nb`	number of bootstrap samples

Details

The function calculates performance metrics, such as:

- R^2= [1/N * [({\sum_{i=1}^N(P_i - (\bar{P})(O_i - (\bar{O})]/\sigma_{P}*\sigma_{O}]^2},

- RMSE= (1/N * ({\sum_{i=1}^N(P_i - O_i)^2)^{1/2}}

and

- MAE = 1/N * {\sum_{i=1}^N|{P_i - O_i}|}

for each imputation method

Supported Imputation Methods:

1. Linear Regression Imputation (lm_imputation): it uses a linear regression model to predict and impute missing values

2. Median Imputation (median_imputation): it replaces missing values with the median of observed values

3. Mean Imputation (mean_imputation): it replaces missing values with the mean of observed values

4. Hot Deck Imputation (hot_deck_imputation): it replaces missing values with similar observed values

5. Expectation-Maximization Imputation (EM_imputation): it uses the Expectation-Maximization algorithm to estimate and impute missing values

Evaluate different methods of imputing missing values using bootstrapping and calculate performance metrics for each method

Value

It returns a performance measures dataframe with rows = methods and columns = methods' performances averaged over bootstraps.

Note

For the methods Median Imputation and Mean Imputation, it is not possible to calculate the R^2 value. This is because the standard deviation is zero based on the following R^2 formula:

R^2= [1/N * [({\sum_{i=1}^N(P_i - (\bar{P})(O_i - (\bar{O})]/\sigma_{P}*\sigma_{O}]^2}

where:

- N is the number of imputations,

- O_i are the observed data point,

- P_i are the imputed data point,

- \bar{O} are the average of the observed data,

- \bar{P} are the average of the imputed data,

- \sigma_{P} are the standard deviation of the imputed data,

- \sigma_{O} are the standard deviation of the observed data.

References

OECD/European Union/EC-JRC (2008), Handbook on Constructing Composite Indicators: Methodology and User Guide, OECD Publishing, Paris, <https://doi.org/10.1787/9789264043466-en>

Examples

data("airquality")
regressors<-colnames(airquality[,c(3,4)])
suppressWarnings(get_all_performance_boot(data =airquality,"Ozone",regressors = regressors,nb=100))

[Package Indicator version 0.1.2 Index]