R: SWATS optimizer

optim_swats {torchopt}

R Documentation

SWATS optimizer

Description

R implementation of the SWATS optimizer proposed by Shekar and Sochee (2018). We used the implementation available at https://github.com/jettify/pytorch-optimizer/ Thanks to Nikolay Novik for providing the pytorch code.

From the abstract by the paper by Shekar and Sochee (2018): Adaptive optimization methods such as Adam, Adagrad or RMSprop have been found to generalize poorly compared to Stochastic gradient descent (SGD). These methods tend to perform well i in the initial portion of training but are outperformed by SGD at later stages of training. We investigate a hybrid strategy that begins training with an adaptive method and switches to SGD when a triggering condition is satisfied. The condition we propose relates to the projection of Adam steps on the gradient subspace. By design, the monitoring process for this condition adds very little overhead and does not increase the number of hyperparameters in the optimizer.

Usage

optim_swats(
  params,
  lr = 0.01,
  betas = c(0.9, 0.999),
  eps = 1e-08,
  weight_decay = 0,
  nesterov = FALSE
)

Arguments

`params`	List of parameters to optimize.
`lr`	Learning rate (default: 1e-3)
`betas`	Coefficients computing running averages of gradient and its square (default: (0.9, 0.999)).
`eps`	Term added to the denominator to improve numerical stability (default: 1e-8).
`weight_decay`	Weight decay (L2 penalty) (default: 0).
`nesterov`	Enables Nesterov momentum (default: False).

Value

A torch optimizer object implementing the step method.

Author(s)

Gilberto Camara, gilberto.camara@inpe.br

Daniel Falbel, daniel.falble@gmail.com

Rolf Simoes, rolf.simoes@inpe.br

Felipe Souza, lipecaso@gmail.com

Alber Sanchez, alber.ipia@inpe.br

References

Nitish Shirish Keskar, Richard Socher "Improving Generalization Performance by Switching from Adam to SGD". International Conference on Learning Representations (ICLR) 2018. https://arxiv.org/abs/1712.07628

Examples

if (torch::torch_is_installed()) {
# function to demonstrate optimization
beale <- function(x, y) {
    log((1.5 - x + x * y)^2 + (2.25 - x - x * y^2)^2 + (2.625 - x + x * y^3)^2)
 }
# define optimizer
optim <- torchopt::optim_swats
# define hyperparams
opt_hparams <- list(lr = 0.01)

# starting point
x0 <- 3
y0 <- 3
# create tensor
x <- torch::torch_tensor(x0, requires_grad = TRUE)
y <- torch::torch_tensor(y0, requires_grad = TRUE)
# instantiate optimizer
optim <- do.call(optim, c(list(params = list(x, y)), opt_hparams))
# run optimizer
steps <- 400
x_steps <- numeric(steps)
y_steps <- numeric(steps)
for (i in seq_len(steps)) {
    x_steps[i] <- as.numeric(x)
    y_steps[i] <- as.numeric(y)
    optim$zero_grad()
    z <- beale(x, y)
    z$backward()
    optim$step()
}
print(paste0("starting value = ", beale(x0, y0)))
print(paste0("final value = ", beale(x_steps[steps], y_steps[steps])))
}

[Package torchopt version 0.1.4 Index]