mdgc {mdgc}R Documentation

Perform Model Estimation and Imputation

Description

A convenience function to perform model estimation and imputation in one call. The learning rate is likely model specific and should be altered. See mdgc_fit.

See the README at https://github.com/boennecd/mdgc for examples.

Usage

mdgc(
  dat,
  lr = 0.001,
  maxit = 25L,
  batch_size = NULL,
  rel_eps = 0.001,
  method = c("svrg", "adam", "aug_Lagran"),
  seed = 1L,
  epsilon = 1e-08,
  beta_1 = 0.9,
  beta_2 = 0.999,
  n_threads = 1L,
  do_reorder = TRUE,
  abs_eps = -1,
  maxpts = 10000L,
  minvls = 100L,
  verbose = FALSE,
  irel_eps = rel_eps,
  imaxit = maxpts,
  iabs_eps = abs_eps,
  iminvls = 1000L,
  start_val = NULL,
  decay = 0.98,
  conv_crit = 1e-05,
  use_aprx = FALSE
)

Arguments

dat

data.frame with continuous, multinomial, ordinal, and binary variables.

lr

learning rate.

maxit

maximum number of iteration.

batch_size

number of observations in each batch.

rel_eps

relative error for each marginal likelihood factor.

method

estimation method to use. Can be "svrg", "adam", or "aug_Lagran".

seed

fixed seed to use. Use NULL if the seed should not be fixed.

epsilon

ADAM parameters.

beta_1

ADAM parameters.

beta_2

ADAM parameters.

n_threads

number of threads to use.

do_reorder

logical for whether to use a heuristic variable reordering. TRUE is likely the best option.

abs_eps

absolute convergence threshold for each marginal likelihood factor.

maxpts

maximum number of samples to draw for each marginal likelihood term.

minvls

minimum number of samples.

verbose

logical for whether to print output during the estimation.

irel_eps

relative error for each term in the imputation.

imaxit

maximum number of samples to draw in the imputation.

iabs_eps

absolute convergence threshold for each term in the imputation.

iminvls

minimum number of samples in the imputation.

start_val

starting value for the covariance matrix. Use NULL if unspecified.

decay

the learning rate used by SVRG is given by lr * decay^iteration_number.

conv_crit

relative convergence threshold.

use_aprx

logical for whether to use an approximation of pnorm and qnorm. This may yield a noticeable reduction in the computation time.

Details

It is important that the input for data has the appropriate types and classes. See get_mdgc.

Value

A list with the following entries:

ximp

data.frame with the observed and imputed values.

imputed

output from mdgc_impute.

vcov

the estimated covariance matrix.

mea

the estimated non-zero mean terms.

Additional elements may be present depending on the chosen method. See mdgc_fit.

References

Kingma, D.P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. abs/1412.6980.

Johnson, R., & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems.

See Also

get_mdgc, mdgc_start_value, get_mdgc_log_ml, mdgc_fit, mdgc_impute

Examples


# there is a bug on CRAN's check on Solaris which I have failed to reproduce.
# See https://github.com/r-hub/solarischeck/issues/8#issuecomment-796735501.
# Thus, this example is not run on Solaris
is_solaris <- tolower(Sys.info()[["sysname"]]) == "sunos"

if(!is_solaris && require(catdata)){
  data(retinopathy)

  # prepare data and save true data set
  retinopathy$RET <- as.ordered(retinopathy$RET)
  retinopathy$SM <- as.logical(retinopathy$SM)

  # randomly mask data
  set.seed(28325145)
  truth <- retinopathy
  for(i in seq_along(retinopathy))
    retinopathy[[i]][runif(NROW(retinopathy)) < .3] <- NA

  cat("\nMasked data:\n")
  print(head(retinopathy, 10))
  cat("\n")

  # impute data
  impu <- mdgc(retinopathy, lr = 1e-3, maxit = 25L, batch_size = 25L,
               rel_eps = 1e-3, maxpts = 5000L, verbose = TRUE,
               n_threads = 1L, method = "svrg")

  # show correlation matrix
  cat("\nEstimated correlation matrix\n")
  print(impu$vcov)

  # compare imputed and true values
  cat("\nObserved;\n")
  print(head(retinopathy, 10))
  cat("\nImputed values:\n")
  print(head(impu$ximp, 10))
  cat("\nTruth:\n")
  print(head(truth, 10))

  # using augmented Lagrangian method
  cat("\n")
  impu_aug <- mdgc(retinopathy, maxit = 25L, rel_eps = 1e-3,
                   maxpts = 5000L, verbose = TRUE,
                   n_threads = 1L, method = "aug_Lagran")

  # compare the log-likelihood estimate
  obj <- get_mdgc_log_ml(retinopathy)
  cat(sprintf(
    "Maximum log likelihood with SVRG vs. augmented Lagrangian:\n  %.2f vs. %.2f\n",
    mdgc_log_ml(obj, vcov = impu    $vcov, mea = impu    $mea, rel_eps = 1e-3),
    mdgc_log_ml(obj, vcov = impu_aug$vcov, mea = impu_aug$mea, rel_eps = 1e-3)))

  # show correlation matrix
  cat("\nEstimated correlation matrix (augmented Lagrangian)\n")
  print(impu_aug$vcov)

  cat("\nImputed values (augmented Lagrangian):\n")
  print(head(impu_aug$ximp, 10))
}



[Package mdgc version 0.1.7 Index]