overimpute {rMIDAS} | R Documentation |
Perform overimputation diagnostic test
Description
overimpute()
spikes additional missingness into the input data and reports imputation accuracy at training intervals specified by the user.
overimpute()
works like train()
– users must specify input data, binary and categorical columns (if data is not generated via convert()
, model parameters for the neural network, and then overimputation parameters (see below for full details).
Usage
overimpute(
data,
binary_columns = NULL,
softmax_columns = NULL,
spikein = 0.3,
training_epochs,
report_ival = 35,
plot_vars = FALSE,
skip_plot = FALSE,
spike_seed = NULL,
save_path = "",
layer_structure = c(256, 256, 256),
learn_rate = 4e-04,
input_drop = 0.8,
seed = 123L,
train_batch = 16L,
latent_space_size = 4,
cont_adj = 1,
binary_adj = 1,
softmax_adj = 1,
dropout_level = 0.5,
vae_layer = FALSE,
vae_alpha = 1,
vae_sample_var = 1
)
Arguments
data |
A data.frame (or coercible) object, or an object of class |
binary_columns |
A vector of column names, containing binary variables. NOTE: if |
softmax_columns |
A list of lists, each internal list corresponding to a single categorical variable and containing names of the one-hot encoded variable names. NOTE: if |
spikein |
A numeric between 0 and 1; the proportion of observed values in the input dataset to be randomly removed. |
training_epochs |
An integer, specifying the number of overimputation training epochs. |
report_ival |
An integer, specifying the number of overimputation training epochs between calculations of loss. Shorter intervals provide a more granular view of model performance but slow down the overimputation process. |
plot_vars |
Boolean, specifies whether to plot the distribution of original versus overimputed values. This takes the form of a density plot for continuous variables and a barplot for categorical variables (showing proportions of each class). |
skip_plot |
Boolean, specifies whether to suppress the main graphical output. This may be desirable when users are conducting a series of overimputation exercises and are primarily interested in the console output. Note, when |
spike_seed , seed |
An integer, to initialize the pseudo-random number generators. Separate seeds can be provided for the spiked-in missingness and imputation, otherwise |
save_path |
String, indicating path to directory to save overimputation figures. Users should include a trailing "/" at the end of the path i.e. save_path = "path/to/figures/". |
layer_structure |
A vector of integers, The number of nodes in each layer of the network (default = |
learn_rate |
A number, the learning rate |
input_drop |
A number between 0 and 1. The probability of corruption for input columns in training mini-batches (default = 0.8). Higher values increase training time but reduce the risk of overfitting. In our experience, values between 0.7 and 0.95 deliver the best performance. |
train_batch |
An integer, the number of observations in training mini-batches (default = 16). |
latent_space_size |
An integer, the number of normal dimensions used to parameterize the latent space. |
cont_adj |
A number, weights the importance of continuous variables in the loss function |
binary_adj |
A number, weights the importance of binary variables in the loss function |
softmax_adj |
A number, weights the importance of categorical variables in the loss function |
dropout_level |
A number between 0 and 1, determines the number of nodes dropped to "thin" the network |
vae_layer |
Boolean, specifies whether to include a variational autoencoder layer in the network |
vae_alpha |
A number, the strength of the prior imposed on the Kullback-Leibler divergence term in the variational autoencoder loss functions. |
vae_sample_var |
A number, the sampling variance of the normal distributions used to parameterize the latent space. |
Details
Accuracy is measured as the RMSE of imputed values versus actual values for continuous variables and classification error for categorical variables (i.e., the fraction of correctly predicted classes subtracted from 1). Both metrics are reported in two forms:
their summed value over all Monte Carlo samples from the estimated missing-data posterior – "Aggregated RMSE" and "Aggregated softmax error”;
their aggregated value divided by the number of such samples – "Individual RMSE" and "Individual softmax error".
In the final model, we recommend selecting the number of training epochs that minimizes the average value of these metrics — weighted by the proportion (or substantive importance) of continuous and categorical variables — in the overimputation exercise. This “early stopping” rule reduces the risk of overtraining and thus, in effect, serves as an extra layer of regularization in the network.
For more information, see Lall and Robinson (2023): doi:10.18637/jss.v107.i09.
Value
Object of class midas
, and outputs both overimputation loss values to the console and generates overimputation graphs.
References
Lall R, Robinson T (2023). “Efficient Multiple Imputation for Diverse Data in Python and R: MIDASpy and rMIDAS.” Journal of Statistical Software, 107(9), 1–38. doi:10.18637/jss.v107.i09.
See Also
train
for the main imputation function.
Examples
## Not run:
# Run where Python initialised and configured correctly
if (python_configured()) {
raw_data <- data.table(a = sample(c("red","yellow","blue",NA),1000, replace = TRUE),
b = 1:1000,
c = sample(c("YES","NO",NA),1000,replace=TRUE),
d = runif(1000,1,10),
e = sample(c("YES","NO"), 1000, replace = TRUE),
f = sample(c("male","female","trans","other",NA), 1000, replace = TRUE))
# Names of bin./cat. variables
test_bin <- c("c","e")
test_cat <- c("a","f")
# Pre-process data
test_data <- convert(raw_data,
bin_cols = test_bin,
cat_cols = test_cat,
minmax_scale = TRUE)
# Overimpute - without plots
test_imp <- overimpute(test_data,
spikein = 0.3,
plot_vars = FALSE,
skip_plot = TRUE,
training_epochs = 10,
report_ival = 5)
}
## End(Not run)