R: Semi-automated hyperparameter estimation

autoHyper {openEBGM}

R Documentation

Semi-automated hyperparameter estimation

Description

autoHyper finds a single hyperparameter estimate using an algorithm that evaluates results from multiple starting points (see exploreHypers). The algorithm verifies that the optimization converges within the bounds of the parameter space and that the chosen estimate (smallest negative log-likelihood) is similar to at least one (see min_conv argument) of the other convergent solutions.

Usage

autoHyper(
  data,
  theta_init,
  squashed = TRUE,
  zeroes = FALSE,
  N_star = 1,
  tol = c(0.05, 0.05, 0.2, 0.2, 0.025),
  min_conv = 1,
  param_limit = 100,
  max_pts = 20000,
  conf_ints = FALSE,
  conf_level = c("95", "80", "90", "99")
)

Arguments

`data`	A data frame from `processRaw` containing columns named N, E, and (if squashed) weight.
`theta_init`	A data frame of initial hyperparameter guesses with columns ordered as: `\alpha_1, \beta_1, \alpha_2, \beta_2, P`.
`squashed`	A scalar logical (`TRUE` or `FALSE`) indicating whether or not data squashing was used.
`zeroes`	A scalar logical specifying if zero counts are included.
`N_star`	A positive scalar whole number value for the minimum count size to be used for hyperparameter estimation. If zeroes are used, set `N_star` to `NULL`.
`tol`	A numeric vector of tolerances for determining how close the chosen estimate must be to at least `min_conv` convergent solutions. Order is `\alpha_1`, `\beta_1`, `\alpha_2`, `\beta_2`, `P`.
`min_conv`	A scalar positive whole number for defining the minimum number of convergent solutions that must be close to the convergent solution with the smallest negative log-likelihood. Must be at least one and at most one less than the number of rows in `theta_init`.
`param_limit`	A scalar numeric value for the largest acceptable value for the `\alpha` and `\beta` estimates. Used to help protect against unreasonable/erroneous estimates.
`max_pts`	A scalar whole number for the largest number of data points allowed. Used to help prevent extremely long run times.
`conf_ints`	A scalar logical indicating if confidence intervals and standard errors should be returned.
`conf_level`	A scalar string for the confidence level used if confidence intervals are requested.

Details

The algorithm first attempts to find a consistently convergent solution using nlminb. If it fails, it will next try nlm. If it still fails, it will try optim (method = "BFGS"). If all three approaches fail, the function returns an error message.

Since this function runs multiple optimization procedures, it is best to start with 5 or less initial starting points (rows in theta_init). If the function runs in a reasonable amount of time, this number can be increased.

This function should not be used with very large data sets since each optimization call will take a long time. squashData can be used first to reduce the size of the data.

It is recommended to use N_star = 1 when practical. Data squashing (see squashData) can be used to further reduce the number of data points.

Asymptotic normal confidence intervals, if requested, use standard errors calculated from the observed Fisher information matrix as discussed in DuMouchel (1999).

Value

A list containing the following elements:

method: A scalar character string for the method used to find the hyperparameter estimate (possibilities are “nlminb”, “nlm”, and “bfgs”).
estimates: A named numeric vector of length 5 for the hyperparameter estimate corresponding to the smallest log-likelihood.
conf_int: A data frame including the standard errors and confidence limits. Only included if conf_ints = TRUE.
num_close: A scalar integer for the number of other convergent solutions that were close (within tolerance) to the chosen estimate.
theta_hats: A data frame for the estimates corresponding to the initial starting points defined by theta_init. See exploreHypers.

References

DuMouchel W (1999). "Bayesian Data Mining in Large Frequency Tables, With an Application to the FDA Spontaneous Reporting System." The American Statistician, 53(3), 177-190.

Examples

data.table::setDTthreads(2)  #only needed for CRAN checks
#Start with 2 or more guesses
theta_init <- data.frame(
  alpha1 = c(0.5, 1),
  beta1  = c(0.5, 1),
  alpha2 = c(2,   3),
  beta2  = c(2,   3),
  p      = c(0.1, 0.2)
)
data(caers)
proc <- processRaw(caers)
squashed <- squashData(proc, bin_size = 300, keep_pts = 10)
squashed <- squashData(squashed, count = 2, bin_size = 13, keep_pts = 10)
suppressWarnings(
  hypers <- autoHyper(squashed, theta_init = theta_init)
)
print(hypers)

[Package openEBGM version 0.9.1 Index]