Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

- Learned optimizers are neural networks that can accelerate machine learning model training
- Blackbox learned optimizers often struggle with stability and generalization when applied to tasks unlike those in their meta-training set
- Tools from dynamical systems are used to investigate the inductive biases and stability properties of optimization algorithms
- Simple modifications to a learned optimizer’s architecture and meta-training procedure can lead to improved stability and inductive bias
- The resulting learned optimizer outperforms the current state of the art and is capable of generalizing to tasks far different from those it was meta-trained on

Paper Content

Introduction

Algorithms for stochastic non-convex optimization are important for neural network training
Choice of optimization algorithm and hyperparameters is critical for model performance and training stability
Few formal rules for choosing optimizers and hyperparameters
Learned optimizers have been proposed
Training methodology for learned optimizers is gradient-based
Learned optimizers have reduced performance and stability when applied in different circumstances
Learned optimizer performance is often highly dependent on random seed
Use dynamical systems theory to characterize stability of parameter dynamics
Propose changes to architecture and training of learned optimizers to improve stability and inductive bias
Demonstrate improved stability and performance of resulting learned optimizer

Learned optimization has seen a surge of interest due to success of deep learning methods
Meta-learning algorithms have been successful in few-shot learning
Early approaches to few-shot learning used blackbox models
Later approaches used algorithmic inductive biases such as gradient descent, metric learning, convex optimization, Bayesian inference, changepoint detection
Adaptive optimization algorithms developed to improve stability of learning algorithms
Line of work focused on meta-learning a neural network to choose hyperparameters
Blackbox optimizers developed to exploit expressivity of neural networks
Curriculum learning used to stabilize learning
Truncated zeroth order optimization and reinforcement learning used to address chaotic behavior
Fixed momentum operators used to create stable-by-design hidden states

Problem statement

Problem of training a neural network by optimizing its parameters
Loss function acts on parameters and data
Learned optimizer is defined by a parameteric update function
Goal is to minimize meta-loss
Examining optimizer performance in noisy quadratic setting
Loss at each timestep is randomly sampled
Update of the form φ t+1 = φ t − (g t + P ∇ t )

Nominal terms shift the region of stability

Stability is necessary for achieving optimizers that reduce the loss over long training horizons.
The gradient of the loss with respect to the meta-trained optimizer is polynomial in the parameters of A with degree 2t - 1.
Instability of the dynamics of φ generally implies instability of the gradient of the loss with respect to P.
ρ(A) ≤ 1 if λ min (A) ≥ -1/α and λ max (A) ≤ 1/α.
The addition of the nominal term α ≥ 0 gives stability margin.
Regularization of P is necessary to limit the magnitude of α.

Preconditioners can stabilize and simplify the design of update dynamics

Adaptive preconditioners are used in machine learning
Adagrad, RMSProp and Adam are fundamental tools in training neural networks
Transformation normalizes step size and makes steps isotropic
Stability is given by Lemma 2
Preconditioner applied to output of learned optimizer is more robust than normalizing input

Adaptive nominal terms improve robust stability

Choosing a nominal α > 0 biases optimizer dynamics toward descent/stability.
Decreasing learning rate over course of training improves performance.
Setting P = P * − αI with nominal gradient term g t = ∇ t yields closed loop dynamics optimal with respect to meta-loss.
Robust stability conditions guarantee stability of dynamical system for all realizations of disturbance.

Non-markovian optimizers require joint stability

Non-Markovian (or hidden state) dynamics play a role in learned optimizers.
Momentum and stable hidden states have been studied in optimization and learned optimization.
Joint stability of hidden state and parameter dynamics must be considered.
Momentum accelerates convergence in full-batch optimization and filters stochastic gradients.

Designing a better learned optimizer

Present regularization strategies and architectural modifications for learned optimizers
Refer to optimizer as STAR learned optimizer
Present experiments on in-meta-distribution and out-of-meta-distribution performance

New design features in the star optimizer

We add a nominal term to bias toward descent
We use magnitude control to make the nominal term equivalent to a hyperparametercontroller learned optimizer
We apply weight decay to discourage violations of the upper bound on stable eigenvalues
We use an adaptive inverse EMA preconditioner to better bias our model toward descent

Overview of the star optimizer

Small_fc_lopt optimizer is a MLP with 197 weights
It takes inputs such as parameter value, gradient, and features such as gradient momentum at different timescales
Outputs an update to the parameter
Applied in parallel to all parameters in the model being trained
Parameterized as expression with β 1 and β 2 set to 0.001
MLP with two hidden layers, each with a width of four
Modified update takes the form of a combined AggMo and Adam term
Blackbox term structured as expression with β 1 , β 2 , β 3 , β 4 as hand-specified constants
Neural network has three output heads
Addition of extra head and multiplicative factor adds 5 parameters to the original 197-parameter optimizer

The star optimizer improves performance

STAR optimizer shows improved stability and faster metatraining than baselines
STAR optimizer is architecturally identical to the blackbox optimizer
STAR optimizer continues to perform strongly even when optimizing for more steps than it was applied for during meta-training
STAR optimizer generalizes well to different network sizes, nonlinearities, and datasets
STAR optimizer outperforms the baseline blackbox model and is comparable to a hyperparameter-tuned Adam model

Discussion

This paper has addressed the role of stability and inductive biases in learned optimizers
Incorporating stabilizing inductive biases results in strong performance both in-distribution and out-of-distribution
Designing inductive biases for learned optimizers is a new line of work
Stabilization of the computation graph is a sufficient condition to guarantee that an optimizer moves downhill on the training loss landscape
It is unclear the extent to which neural network training resembles convex optimization problems
Nominal terms, preconditioning, and weight decay improve the inductive bias of learned optimizers
Nominal terms improve the stability and trainability of learned optimizers
STAR learned optimizer trains faster, achieves better fully-trained optimizers, and has better stability than prior learned optimizers
STAR optimizer is able to generalize to never before seen problems
Weight decay between 0.1 and 0.5 is a good tradeoff between stabilizing behavior and avoiding over-damping
Adam-style preconditioning improves the blackbox term
Including the blackbox term improves performance at all points in meta-training and generalization

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Problem statement#

Nominal terms shift the region of stability#

Preconditioners can stabilize and simplify the design of update dynamics#

Adaptive nominal terms improve robust stability#

Non-markovian optimizers require joint stability#

Designing a better learned optimizer#

New design features in the star optimizer#

Overview of the star optimizer#

The star optimizer improves performance#

Discussion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Problem statement

Nominal terms shift the region of stability

Preconditioners can stabilize and simplify the design of update dynamics

Adaptive nominal terms improve robust stability

Non-markovian optimizers require joint stability

Designing a better learned optimizer

New design features in the star optimizer

Overview of the star optimizer

The star optimizer improves performance

Discussion