Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

    • Learned optimizers are neural networks that can accelerate machine learning model training
    • Blackbox learned optimizers often struggle with stability and generalization when applied to tasks unlike those in their meta-training set
    • Tools from dynamical systems are used to investigate the inductive biases and stability properties of optimization algorithms
    • Simple modifications to a learned optimizer’s architecture and meta-training procedure can lead to improved stability and inductive bias
    • The resulting learned optimizer outperforms the current state of the art and is capable of generalizing to tasks far different from those it was meta-trained on

Paper Content

Introduction

  • Algorithms for stochastic non-convex optimization are important for neural network training
  • Choice of optimization algorithm and hyperparameters is critical for model performance and training stability
  • Few formal rules for choosing optimizers and hyperparameters
  • Learned optimizers have been proposed
  • Training methodology for learned optimizers is gradient-based
  • Learned optimizers have reduced performance and stability when applied in different circumstances
  • Learned optimizer performance is often highly dependent on random seed
  • Use dynamical systems theory to characterize stability of parameter dynamics
  • Propose changes to architecture and training of learned optimizers to improve stability and inductive bias
  • Demonstrate improved stability and performance of resulting learned optimizer
  • Learned optimization has seen a surge of interest due to success of deep learning methods
  • Meta-learning algorithms have been successful in few-shot learning
  • Early approaches to few-shot learning used blackbox models
  • Later approaches used algorithmic inductive biases such as gradient descent, metric learning, convex optimization, Bayesian inference, changepoint detection
  • Adaptive optimization algorithms developed to improve stability of learning algorithms
  • Line of work focused on meta-learning a neural network to choose hyperparameters
  • Blackbox optimizers developed to exploit expressivity of neural networks
  • Curriculum learning used to stabilize learning
  • Truncated zeroth order optimization and reinforcement learning used to address chaotic behavior
  • Fixed momentum operators used to create stable-by-design hidden states

Problem statement

  • Problem of training a neural network by optimizing its parameters
  • Loss function acts on parameters and data
  • Learned optimizer is defined by a parameteric update function
  • Goal is to minimize meta-loss
  • Examining optimizer performance in noisy quadratic setting
  • Loss at each timestep is randomly sampled
  • Update of the form φ t+1 = φ t − (g t + P ∇ t )

Nominal terms shift the region of stability

  • Stability is necessary for achieving optimizers that reduce the loss over long training horizons.
  • The gradient of the loss with respect to the meta-trained optimizer is polynomial in the parameters of A with degree 2t - 1.
  • Instability of the dynamics of φ generally implies instability of the gradient of the loss with respect to P.
  • ρ(A) ≤ 1 if λ min (A) ≥ -1/α and λ max (A) ≤ 1/α.
  • The addition of the nominal term α ≥ 0 gives stability margin.
  • Regularization of P is necessary to limit the magnitude of α.

Preconditioners can stabilize and simplify the design of update dynamics

  • Adaptive preconditioners are used in machine learning
  • Adagrad, RMSProp and Adam are fundamental tools in training neural networks
  • Transformation normalizes step size and makes steps isotropic
  • Stability is given by Lemma 2
  • Preconditioner applied to output of learned optimizer is more robust than normalizing input

Adaptive nominal terms improve robust stability

  • Choosing a nominal α > 0 biases optimizer dynamics toward descent/stability.
  • Decreasing learning rate over course of training improves performance.
  • Setting P = P * − αI with nominal gradient term g t = ∇ t yields closed loop dynamics optimal with respect to meta-loss.
  • Robust stability conditions guarantee stability of dynamical system for all realizations of disturbance.

Non-markovian optimizers require joint stability

  • Non-Markovian (or hidden state) dynamics play a role in learned optimizers.
  • Momentum and stable hidden states have been studied in optimization and learned optimization.
  • Joint stability of hidden state and parameter dynamics must be considered.
  • Momentum accelerates convergence in full-batch optimization and filters stochastic gradients.

Designing a better learned optimizer

  • Present regularization strategies and architectural modifications for learned optimizers
  • Refer to optimizer as STAR learned optimizer
  • Present experiments on in-meta-distribution and out-of-meta-distribution performance

New design features in the star optimizer

  • We add a nominal term to bias toward descent
  • We use magnitude control to make the nominal term equivalent to a hyperparametercontroller learned optimizer
  • We apply weight decay to discourage violations of the upper bound on stable eigenvalues
  • We use an adaptive inverse EMA preconditioner to better bias our model toward descent

Overview of the star optimizer

  • Small_fc_lopt optimizer is a MLP with 197 weights
  • It takes inputs such as parameter value, gradient, and features such as gradient momentum at different timescales
  • Outputs an update to the parameter
  • Applied in parallel to all parameters in the model being trained
  • Parameterized as expression with β 1 and β 2 set to 0.001
  • MLP with two hidden layers, each with a width of four
  • Modified update takes the form of a combined AggMo and Adam term
  • Blackbox term structured as expression with β 1 , β 2 , β 3 , β 4 as hand-specified constants
  • Neural network has three output heads
  • Addition of extra head and multiplicative factor adds 5 parameters to the original 197-parameter optimizer

The star optimizer improves performance

  • STAR optimizer shows improved stability and faster metatraining than baselines
  • STAR optimizer is architecturally identical to the blackbox optimizer
  • STAR optimizer continues to perform strongly even when optimizing for more steps than it was applied for during meta-training
  • STAR optimizer generalizes well to different network sizes, nonlinearities, and datasets
  • STAR optimizer outperforms the baseline blackbox model and is comparable to a hyperparameter-tuned Adam model

Discussion

  • This paper has addressed the role of stability and inductive biases in learned optimizers
  • Incorporating stabilizing inductive biases results in strong performance both in-distribution and out-of-distribution
  • Designing inductive biases for learned optimizers is a new line of work
  • Stabilization of the computation graph is a sufficient condition to guarantee that an optimizer moves downhill on the training loss landscape
  • It is unclear the extent to which neural network training resembles convex optimization problems
  • Nominal terms, preconditioning, and weight decay improve the inductive bias of learned optimizers
  • Nominal terms improve the stability and trainability of learned optimizers
  • STAR learned optimizer trains faster, achieves better fully-trained optimizers, and has better stability than prior learned optimizers
  • STAR optimizer is able to generalize to never before seen problems
  • Weight decay between 0.1 and 0.5 is a good tradeoff between stabilizing behavior and avoiding over-damping
  • Adam-style preconditioning improves the blackbox term
  • Including the blackbox term improves performance at all points in meta-training and generalization