R: Partially Separable Function Optimization

psqn {psqn}

R Documentation

Partially Separable Function Optimization

Description

Optimization method for specially structured partially separable functions. The psqn_aug_Lagrang function supports non-linear equality constraints using an augmented Lagrangian method.

Usage

psqn(
  par,
  fn,
  n_ele_func,
  rel_eps = 1e-08,
  max_it = 100L,
  n_threads = 1L,
  c1 = 1e-04,
  c2 = 0.9,
  use_bfgs = TRUE,
  trace = 0L,
  cg_tol = 0.5,
  strong_wolfe = TRUE,
  env = NULL,
  max_cg = 0L,
  pre_method = 1L,
  mask = as.integer(c()),
  gr_tol = -1
)

psqn_aug_Lagrang(
  par,
  fn,
  n_ele_func,
  consts,
  n_constraints,
  multipliers = as.numeric(c()),
  penalty_start = 1L,
  rel_eps = 1e-08,
  max_it = 100L,
  max_it_outer = 100L,
  violations_norm_thresh = 1e-06,
  n_threads = 1L,
  c1 = 1e-04,
  c2 = 0.9,
  tau = 1.5,
  use_bfgs = TRUE,
  trace = 0L,
  cg_tol = 0.5,
  strong_wolfe = TRUE,
  env = NULL,
  max_cg = 0L,
  pre_method = 1L,
  mask = as.integer(c()),
  gr_tol = -1
)

Arguments

`par`	Initial values for the parameters. It is a concatenated vector of the global parameters and all the private parameters.
`fn`	Function to compute the element functions and their derivatives. Each call computes an element function. See the examples section.
`n_ele_func`	Number of element functions.
`rel_eps`	Relative convergence threshold.
`max_it`	Maximum number of iterations.
`n_threads`	Number of threads to use.
`c1`, `c2`	Thresholds for the Wolfe condition.
`use_bfgs`	Logical for whether to use BFGS updates or SR1 updates.
`trace`	Integer where larger values gives more information during the optimization.
`cg_tol`	Threshold for the conjugate gradient method.
`strong_wolfe`	`TRUE` if the strong Wolfe condition should be used.
`env`	Environment to evaluate `fn` in. `NULL` yields the global environment.
`max_cg`	Maximum number of conjugate gradient iterations in each iteration. Use zero if there should not be a limit.
`pre_method`	Preconditioning method in the conjugate gradient method. Zero yields no preconditioning, one yields diagonal preconditioning, two yields the incomplete Cholesky factorization from Eigen, and three yields a block diagonal preconditioning. One and three are fast options with three seeming to work well for some poorly conditioned problems.
`mask`	zero based indices for parameters to mask (i.e. fix).
`gr_tol`	convergence tolerance for the Euclidean norm of the gradient. A negative value yields no check.
`consts`	Function to compute the constraints which must be equal to zero. See the example Section.
`n_constraints`	The number of constraints.
`multipliers`	Staring values for the multipliers in the augmented Lagrangian method. There needs to be the same number of multipliers as the number of constraints. An empty vector, `numeric()`, yields zero as the starting value for all multipliers.
`penalty_start`	Starting value for the penalty parameterin the augmented Lagrangian method.
`max_it_outer`	Maximum number of augmented Lagrangian steps.
`violations_norm_thresh`	Threshold for the norm of the constraint violations.
`tau`	Multiplier used for the penalty parameter between each outer iterations.

Details

The function follows the method described by Nocedal and Wright (2006) and mainly what is described in Section 7.4. Details are provided in the psqn vignette. See vignette("psqn", package = "psqn").

The partially separable function we consider are special in that the function to be minimized is a sum of so-called element functions which only depend on few shared (global) parameters and some private parameters which are particular to each element function. A generic method for other partially separable functions is available through the psqn_generic function.

The optimization function is also available in C++ as a header-only library. Using C++ may reduce the computation time substantially. See the vignette in the package for examples.

You have to define the PSQN_USE_EIGEN macro variable in C++ if you want to use the incomplete Cholesky factorization from Eigen. You will also have to include Eigen or RcppEigen. This is not needed when you use the R functions documented here. The incomplete Cholesky factorization comes with some additional overhead because of the allocations of the factorization, forming the factorization, and the assignment of the sparse version of the Hessian approximation. However, it may substantially reduce the required number of conjugate gradient iterations.

Value

pqne: An object with the following elements:

`par`	the estimated global and private parameters.
`value`	function value at `par`.
`info`	information code. 0 implies convergence. -1 implies that the maximum number iterations is reached. -2 implies that the conjugate gradient method failed. -3 implies that the line search failed. -4 implies that the user interrupted the optimization.
`counts`	An integer vector with the number of function evaluations, gradient evaluations, and the number of conjugate gradient iterations.
`convergence`	`TRUE` if `info == 0`.

psqn_aug_Lagrang: Like psqn with a few exceptions:

`multipliers`	final multipliers from the augmented Lagrangian method.
`counts`	has an additional element called `n_aug_Lagrang` with the number of augmented Lagrangian iterations.
`penalty`	the final penalty parameter from the augmented Lagrangian method.

References

Nocedal, J. and Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer.

Lin, C. and Moré, J. J. (1999). Incomplete Cholesky factorizations with limited memory. SIAM Journal on Scientific Computing.

Examples

# example with inner problem in a Taylor approximation for a GLMM as in the
# vignette

# assign model parameters, number of random effects, and fixed effects
q <- 2 # number of private parameters per cluster
p <- 1 # number of global parameters
beta <- sqrt((1:p) / sum(1:p))
Sigma <- diag(q)

# simulate a data set
set.seed(66608927)
n_clusters <- 20L # number of clusters
sim_dat <- replicate(n_clusters, {
  n_members <- sample.int(8L, 1L) + 2L
  X <- matrix(runif(p * n_members, -sqrt(6 / 2), sqrt(6 / 2)),
              p)
  u <- drop(rnorm(q) %*% chol(Sigma))
  Z <- matrix(runif(q * n_members, -sqrt(6 / 2 / q), sqrt(6 / 2 / q)),
              q)
  eta <- drop(beta %*% X + u %*% Z)
  y <- as.numeric((1 + exp(-eta))^(-1) > runif(n_members))

  list(X = X, Z = Z, y = y, u = u, Sigma_inv = solve(Sigma))
}, simplify = FALSE)

# evaluates the negative log integrand.
#
# Args:
#   i cluster/element function index.
#   par the global and private parameter for this cluster. It has length
#       zero if the number of parameters is requested. That is, a 2D integer
#       vector the number of global parameters as the first element and the
#       number of private parameters as the second element.
#   comp_grad logical for whether to compute the gradient.
r_func <- function(i, par, comp_grad){
  dat <- sim_dat[[i]]
  X <- dat$X
  Z <- dat$Z

  if(length(par) < 1)
    # requested the dimension of the parameter
    return(c(global_dim = NROW(dat$X), private_dim = NROW(dat$Z)))

  y <- dat$y
  Sigma_inv <- dat$Sigma_inv

  beta <- par[1:p]
  uhat <- par[1:q + p]
  eta <- drop(beta %*% X + uhat %*% Z)
  exp_eta <- exp(eta)

  out <- -sum(y * eta) + sum(log(1 + exp_eta)) +
    sum(uhat * (Sigma_inv %*% uhat)) / 2
  if(comp_grad){
    d_eta <- -y + exp_eta / (1 + exp_eta)
    grad <- c(X %*% d_eta,
              Z %*% d_eta + dat$Sigma_inv %*% uhat)
    attr(out, "grad") <- grad
  }

  out
}

# optimize the log integrand
res <- psqn(par = rep(0, p + q * n_clusters), fn = r_func,
            n_ele_func = n_clusters)
head(res$par, p)              # the estimated global parameters
tail(res$par, n_clusters * q) # the estimated private parameters

# compare with
beta
c(sapply(sim_dat, "[[", "u"))

# add equality constraints
idx_constrained <- list(c(2L, 19L), c(1L, 5L, 8L))

# evaluates the c(x) in equalities c(x) = 0.
#
# Args:
#   i constrain index.
#   par the constrained parameters. It has length zero if we need to pass the
#       one-based indices of the parameters that the i'th constrain depends on.
#   what integer which is zero if the function should be returned and one if the
#        gradient should be computed.
consts <- function(i, par, what){
  if(length(par) == 0)
    # need to return the indices
    return(idx_constrained[[i]])

  if(i == 1){
    # a linear equality constrain. It is implemented as a non-linear constrain
    # though
    out <- sum(par) - 3
    if(what == 1)
      attr(out, "grad") <- rep(1, length(par))

  } else if(i == 2){
    # the parameters need to be on a circle
    out <- sum(par^2) - 1
    if(what == 1)
      attr(out, "grad") <- 2 * par
  }

  out
}

# optimize with the constraints
res_consts <- psqn_aug_Lagrang(
  par = rep(0, p + q * n_clusters), fn = r_func, consts = consts,
  n_ele_func = n_clusters, n_constraints = length(idx_constrained))

res_consts
res_consts$multipliers # the estimated multipliers
res_consts$penalty # the penalty parameter

# the function value is higher (worse) as expected
res$value - res_consts$value

# the two constraints are satisfied
sum(res_consts$par[idx_constrained[[1]]]) - 3   # ~ 0
sum(res_consts$par[idx_constrained[[2]]]^2) - 1 # ~ 0

# we can also use another pre conditioner
res_consts_chol <- psqn_aug_Lagrang(
  par = rep(0, p + q * n_clusters), fn = r_func, consts = consts,
  n_ele_func = n_clusters, n_constraints = length(idx_constrained),
  pre_method = 2L)

res_consts_chol

[Package psqn version 0.3.1 Index]