get_all_performance_boot {Indicator} | R Documentation |
Function to evaluate different nan imputation methods with bootstrap
Description
The get_all_performance_boot function is designed to evaluate different methods of imputing missing values into a dataset. The evaluation is performed using bootstrapping to ensure robustness of the results
Usage
get_all_performance_boot(data, to_impute, regressors, nb = 1)
Arguments
data |
dataframe with rows = observations and columns = quantitative variables |
to_impute |
string , name of the variables where there are NANs to impute |
regressors |
vector of string with names of the variables to use to apply 1st, 4th imputation method |
nb |
number of bootstrap samples |
Details
The function calculates performance metrics, such as:
- ,
-
and
-
for each imputation method
Supported Imputation Methods:
1. Linear Regression Imputation (lm_imputation): it uses a linear regression model to predict and impute missing values
2. Median Imputation (median_imputation): it replaces missing values with the median of observed values
3. Mean Imputation (mean_imputation): it replaces missing values with the mean of observed values
4. Hot Deck Imputation (hot_deck_imputation): it replaces missing values with similar observed values
5. Expectation-Maximization Imputation (EM_imputation): it uses the Expectation-Maximization algorithm to estimate and impute missing values
Evaluate different methods of imputing missing values using bootstrapping and calculate performance metrics for each method
Value
It returns a performance measures dataframe with rows = methods and columns = methods' performances averaged over bootstraps.
Note
For the methods Median Imputation and Mean Imputation, it is not possible to calculate the R^2 value. This is because the standard deviation is zero based on the following R^2 formula:
where:
- is the number of imputations,
- are the observed data point,
- are the imputed data point,
- are the average of the observed data,
- are the average of the imputed data,
- are the standard deviation of the imputed data,
- are the standard deviation of the observed data.
References
OECD/European Union/EC-JRC (2008), Handbook on Constructing Composite Indicators: Methodology and User Guide, OECD Publishing, Paris, <https://doi.org/10.1787/9789264043466-en>
Examples
data("airquality")
regressors<-colnames(airquality[,c(3,4)])
suppressWarnings(get_all_performance_boot(data =airquality,"Ozone",regressors = regressors,nb=100))