R: Train an imputation model using Midas

train {rMIDAS}

R Documentation

Train an imputation model using Midas

Description

Build and run a MIDAS neural network on the supplied missing data.

Usage

train(
  data,
  binary_columns = NULL,
  softmax_columns = NULL,
  training_epochs = 10L,
  layer_structure = c(256, 256, 256),
  learn_rate = 4e-04,
  input_drop = 0.8,
  seed = 123L,
  train_batch = 16L,
  latent_space_size = 4,
  cont_adj = 1,
  binary_adj = 1,
  softmax_adj = 1,
  dropout_level = 0.5,
  vae_layer = FALSE,
  vae_alpha = 1,
  vae_sample_var = 1
)

Arguments

`data`	A data.frame (or coercible) object, or an object of class `midas_pre` created from rMIDAS::convert()
`binary_columns`	A vector of column names, containing binary variables. NOTE: if `data` is a `midas_pre` object, this argument will be overwritten.
`softmax_columns`	A list of lists, each internal list corresponding to a single categorical variable and containing names of the one-hot encoded variable names. NOTE: if `data` is a `midas_pre` object, this argument will be overwritten.
`training_epochs`	An integer, indicating the number of forward passes to conduct when running the model.
`layer_structure`	A vector of integers, The number of nodes in each layer of the network (default = `c(256, 256, 256)`, denoting a three-layer network with 256 nodes per layer). Larger networks can learn more complex data structures but require longer training and are more prone to overfitting.
`learn_rate`	A number, the learning rate `\gamma` (default = 0.0001), which controls the size of the weight adjustment in each training epoch. In general, higher values reduce training time at the expense of less accurate results.
`input_drop`	A number between 0 and 1. The probability of corruption for input columns in training mini-batches (default = 0.8). Higher values increase training time but reduce the risk of overfitting. In our experience, values between 0.7 and 0.95 deliver the best performance.
`seed`	An integer, the value to which Python's pseudo-random number generator is initialized. This enables users to ensure that data shuffling, weight and bias initialization, and missingness indicator vectors are reproducible.
`train_batch`	An integer, the number of observations in training mini-batches (default = 16).
`latent_space_size`	An integer, the number of normal dimensions used to parameterize the latent space.
`cont_adj`	A number, weights the importance of continuous variables in the loss function
`binary_adj`	A number, weights the importance of binary variables in the loss function
`softmax_adj`	A number, weights the importance of categorical variables in the loss function
`dropout_level`	A number between 0 and 1, determines the number of nodes dropped to "thin" the network
`vae_layer`	Boolean, specifies whether to include a variational autoencoder layer in the network
`vae_alpha`	A number, the strength of the prior imposed on the Kullback-Leibler divergence term in the variational autoencoder loss functions.
`vae_sample_var`	A number, the sampling variance of the normal distributions used to parameterize the latent space.

Details

For more information, see Lall and Robinson (2023): doi:10.18637/jss.v107.i09.

Value

Object of class midas from which completed datasets can be drawn, using rMIDAS::complete()

References

Lall R, Robinson T (2023). “Efficient Multiple Imputation for Diverse Data in Python and R: MIDASpy and rMIDAS.” Journal of Statistical Software, 107(9), 1–38. doi:10.18637/jss.v107.i09.

Examples

# Generate raw data, with numeric, binary, and categorical variables
## Not run: 
# Run where Python available and configured correctly
if (python_configured()) {
set.seed(89)
n_obs <- 10000
raw_data <- data.table(a = sample(c("red","yellow","blue",NA),n_obs, replace = TRUE),
                       b = 1:n_obs,
                       c = sample(c("YES","NO",NA),n_obs,replace=TRUE),
                       d = runif(n_obs,1,10),
                       e = sample(c("YES","NO"), n_obs, replace = TRUE),
                       f = sample(c("male","female","trans","other",NA), n_obs, replace = TRUE))

# Names of bin./cat. variables
test_bin <- c("c","e")
test_cat <- c("a","f")

# Pre-process data
test_data <- convert(raw_data,
                     bin_cols = test_bin,
                     cat_cols = test_cat,
                     minmax_scale = TRUE)

# Run imputations
test_imp <- train(test_data)

# Generate datasets
complete_datasets <- complete(test_imp, m = 5, fast = FALSE)

# Use Rubin's rules to combine m regression models
midas_pool <- combine(formula = d~a+c+e+f,
                      complete_datasets)
}

## End(Not run)

[Package rMIDAS version 1.0.0 Index]