get_all_performance {Indicator} | R Documentation |
Function to evaluate different nan imputation methods
Description
The get_all_performance_boot function is designed to evaluate different methods of imputing missing values into a dataset
Usage
get_all_performance(data, to_impute, regressors)
Arguments
data |
dataframe with rows = observations and columns = quantitative variables |
to_impute |
string , name of the variables where there are NANs to impute |
regressors |
vector of string with names of the variables to use to apply 1st,4th imputation method |
Details
The function calculates performance metrics, such as:
- R^2 = [1/N * [({\sum_{i=1}^N(P_i - (\bar{P})(O_i -
(\bar{O})]/\sigma_{P}*\sigma_{O}]^2}
,
- RMSE= (1/N * ({\sum_{i=1}^N(P_i - O_i)^2)^{1/2}}
and
- MAE = 1/N * {\sum_{i=1}^N|{P_i - O_i}|}
for each imputation method
Supported Imputation Methods:
1. Linear Regression Imputation (lm_imputation): it uses a linear regression model to predict and impute missing values
2. Median Imputation (median_imputation): it replaces missing values with the median of observed values
3. Mean Imputation (mean_imputation): it replaces missing values with the mean of observed values
4. Hot Deck Imputation (hot_deck_imputation): it replaces missing values with similar observed values
5. Expectation-Maximization Imputation (EM_imputation): it uses the Expectation-Maximization algorithm to estimate and impute missing values
It evaluate different methods of imputing missing values and calculate performance metrics for each method
Value
It returns a performance measures dataframe with rows = methods and columns = methods' performances
Note
For the methods Median Imputation and Mean Imputation, it is not possible to calculate the R^2 value. This is because the standard deviation is zero based on the following R^2 formula:
R^2 = [1/N * [({\sum_{i=1}^N(P_i - (\bar{P})(O_i -
(\bar{O})]/\sigma_{P}*\sigma_{O}]^2}
where:
- N
is the number of imputations,
- O_i
are the observed data point,
- P_i
are the imputed data point,
- \bar{O}
are the average of the observed data,
- \bar{P}
are the average of the imputed data,
- \sigma_{P}
are the standard deviation of the imputed data,
- \sigma_{O}
are the standard deviation of the observed data
References
OECD/European Union/EC-JRC (2008), Handbook on Constructing Composite Indicators: Methodology and User Guide, OECD Publishing, Paris, <https://doi.org/10.1787/9789264043466-en>
Examples
data("airquality")
regressors<-colnames(airquality[,c(3,4)])
suppressWarnings(get_all_performance(data =airquality,"Ozone",regressors = regressors))