impute_data {missCompare}R Documentation

Missing data imputation with various methods

Description

impute_data imputes a dataframe with missing values with selected algorithm(s)

Usage

impute_data(X, scale = TRUE, n.iter = 10, sel_method = c(1:16))

Arguments

X

Dataframe - the original data that contains missing values.

scale

Boolean with default TRUE. Scaling will scale and center all numeric variables to mean = 0 and standard deviation = 1. This is strongly suggested for all PCA-based methods, and for the sake of comparison (and in case all methods are run), for the other methods too. Please note, however, that some methods (e.g. pcaMethods NLPCA, missForest, etc.) are equipped to handle non-linear data. In these cases scaling is up to the user. Factor variables will not be scaled.

n.iter

Number of iterations to perform with default 10. This will only affect the probabilistic methods that allow for a multiple imputation framework. The rest of the methods (if specified to run) will only generate 1 imputed dataframe.

sel_method

Numeric vector that specifies which methods to run. Default is all methods (1-16), but any combinations, including selecting a single method, are allowed.

1 random replacement
2 median imputation
3 mean imputation
4 missMDA Regularized
5 missMDA EM
6 pcaMethods PPCA
7 pcaMethods svdImpute
8 pcaMethods BPCA
9 pcaMethods NIPALS
10 pcaMethods NLPCA
11 mice mixed
12 mi Bayesian
13 Amelia II
14 missForest
15 Hmisc aregImpute
16 VIM kNN

Details

This function assumes that the user has performed simulations using the impute_simulated function and arrived to some conclusions regarding which functions would be the best performing on their datasets. This function offers a convenient way to impute datasets with a curated list of functions. Some of the functions allow for a multiple imputation framework (they operate with probabilistic models, hence there is uncertainty in the imputed values), so this function allows to generate multiple imputed datasets. The user can decide to impute their dataframe with a selected method or with multiple methods.

Value

A nested list of imputed datasets. In case only a subset of methods was selected the non-selected list elements will be empty.

random_replacement

Imputed dataset using random replacement

mean_imputation

Imputed dataset using mean imputation

median_imputation

Imputed dataset using median imputation

missMDA_reg_imputation

Imputed dataset using the missMDA regularized imputation algorithm

missMDA_EM_imputation

Imputed dataset using the missMDA EM imputation algorithm

pcaMethods_PPCA_imputation

Imputed dataset using the pcaMethods PPCA imputation algorithm

pcaMethods_svdImpute_imputation

Imputed dataset using the pcaMethods svdImpute imputation algorithm

pcaMethods_BPCA_imputation

Imputed dataset using the pcaMethods BPCA imputation algorithm

pcaMethods_Nipals_imputation

Imputed dataset using the pcaMethods NIPALS imputation algorithm

pcaMethods_NLPCA_imputation

Imputed dataset using the pcaMethods NLPCA imputation algorithm

mice_mixed_imputation

Imputed dataset using the mice mixed imputation algorithm

mi_Bayesian_imputation

Imputed dataset using the mi Bayesian imputation algorithm

ameliaII_imputation

Imputed dataset using the Amelia2 imputation algorithm replacement

missForest_imputation

Imputed dataset using the missForest imputation algorithm replacement

Hmisc_aregImpute_imputation

Imputed dataset using the Hmisc aregImpute imputation algorithm

VIM_kNN_imputation

Imputed dataset using the VIM kNN imputation algorithm replacement

Examples

## running 10 iterations of all algorithms (that allow for multiple imputation) and
## one copy of those that do not allow for multiple imputations
# impute_data(df, scale = TRUE, n.iter = 10,
#            sel_method = c(1:16))
## running 20 iterations of missForest (e.g. this was the best performing algorithm
## in simulations) on a non-scaled dataframe
# impute_data(df, scale = FALSE, n.iter = 20,
#            sel_method = c(14))
## running 1 iterations of four selected non-probabilistic algorithms on a scaled dataframe
# impute_data(df, scale = TRUE, n.iter = 1,
#            sel_method = c(2:3, 5, 7))


[Package missCompare version 1.0.3 Index]