Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Learned optimizers are neural networks that can accelerate machine learning model training
- Blackbox learned optimizers often struggle with stability and generalization when applied to tasks unlike those in their meta-training set
- Tools from dynamical systems are used to investigate the inductive biases and stability properties of optimization algorithms
- Simple modifications to a learned optimizer’s architecture and meta-training procedure can lead to improved stability and inductive bias
- The resulting learned optimizer outperforms the current state of the art and is capable of generalizing to tasks far different from those it was meta-trained on
Paper Content
Introduction
- Algorithms for stochastic non-convex optimization are important for neural network training
- Choice of optimization algorithm and hyperparameters is critical for model performance and training stability
- Few formal rules for choosing optimizers and hyperparameters
- Learned optimizers have been proposed
- Training methodology for learned optimizers is gradient-based
- Learned optimizers have reduced performance and stability when applied in different circumstances
- Learned optimizer performance is often highly dependent on random seed
- Use dynamical systems theory to characterize stability of parameter dynamics
- Propose changes to architecture and training of learned optimizers to improve stability and inductive bias
- Demonstrate improved stability and performance of resulting learned optimizer
Related work
- Learned optimization has seen a surge of interest due to success of deep learning methods
- Meta-learning algorithms have been successful in few-shot learning
- Early approaches to few-shot learning used blackbox models
- Later approaches used algorithmic inductive biases such as gradient descent, metric learning, convex optimization, Bayesian inference, changepoint detection
- Adaptive optimization algorithms developed to improve stability of learning algorithms
- Line of work focused on meta-learning a neural network to choose hyperparameters
- Blackbox optimizers developed to exploit expressivity of neural networks
- Curriculum learning used to stabilize learning
- Truncated zeroth order optimization and reinforcement learning used to address chaotic behavior
- Fixed momentum operators used to create stable-by-design hidden states
Problem statement
- Problem of training a neural network by optimizing its parameters
- Loss function acts on parameters and data
- Learned optimizer is defined by a parameteric update function
- Goal is to minimize meta-loss
- Examining optimizer performance in noisy quadratic setting
- Loss at each timestep is randomly sampled
- Update of the form φ t+1 = φ t − (g t + P ∇ t )
Nominal terms shift the region of stability
- Stability is necessary for achieving optimizers that reduce the loss over long training horizons.
- The gradient of the loss with respect to the meta-trained optimizer is polynomial in the parameters of A with degree 2t - 1.
- Instability of the dynamics of φ generally implies instability of the gradient of the loss with respect to P.
- ρ(A) ≤ 1 if λ min (A) ≥ -1/α and λ max (A) ≤ 1/α.
- The addition of the nominal term α ≥ 0 gives stability margin.
- Regularization of P is necessary to limit the magnitude of α.
Preconditioners can stabilize and simplify the design of update dynamics
- Adaptive preconditioners are used in machine learning
- Adagrad, RMSProp and Adam are fundamental tools in training neural networks
- Transformation normalizes step size and makes steps isotropic
- Stability is given by Lemma 2
- Preconditioner applied to output of learned optimizer is more robust than normalizing input
Adaptive nominal terms improve robust stability
- Choosing a nominal α > 0 biases optimizer dynamics toward descent/stability.
- Decreasing learning rate over course of training improves performance.
- Setting P = P * − αI with nominal gradient term g t = ∇ t yields closed loop dynamics optimal with respect to meta-loss.
- Robust stability conditions guarantee stability of dynamical system for all realizations of disturbance.
Non-markovian optimizers require joint stability
- Non-Markovian (or hidden state) dynamics play a role in learned optimizers.
- Momentum and stable hidden states have been studied in optimization and learned optimization.
- Joint stability of hidden state and parameter dynamics must be considered.
- Momentum accelerates convergence in full-batch optimization and filters stochastic gradients.
Designing a better learned optimizer
- Present regularization strategies and architectural modifications for learned optimizers
- Refer to optimizer as STAR learned optimizer
- Present experiments on in-meta-distribution and out-of-meta-distribution performance
New design features in the star optimizer
- We add a nominal term to bias toward descent
- We use magnitude control to make the nominal term equivalent to a hyperparametercontroller learned optimizer
- We apply weight decay to discourage violations of the upper bound on stable eigenvalues
- We use an adaptive inverse EMA preconditioner to better bias our model toward descent
Overview of the star optimizer
- Small_fc_lopt optimizer is a MLP with 197 weights
- It takes inputs such as parameter value, gradient, and features such as gradient momentum at different timescales
- Outputs an update to the parameter
- Applied in parallel to all parameters in the model being trained
- Parameterized as expression with β 1 and β 2 set to 0.001
- MLP with two hidden layers, each with a width of four
- Modified update takes the form of a combined AggMo and Adam term
- Blackbox term structured as expression with β 1 , β 2 , β 3 , β 4 as hand-specified constants
- Neural network has three output heads
- Addition of extra head and multiplicative factor adds 5 parameters to the original 197-parameter optimizer
The star optimizer improves performance
- STAR optimizer shows improved stability and faster metatraining than baselines
- STAR optimizer is architecturally identical to the blackbox optimizer
- STAR optimizer continues to perform strongly even when optimizing for more steps than it was applied for during meta-training
- STAR optimizer generalizes well to different network sizes, nonlinearities, and datasets
- STAR optimizer outperforms the baseline blackbox model and is comparable to a hyperparameter-tuned Adam model
Discussion
- This paper has addressed the role of stability and inductive biases in learned optimizers
- Incorporating stabilizing inductive biases results in strong performance both in-distribution and out-of-distribution
- Designing inductive biases for learned optimizers is a new line of work
- Stabilization of the computation graph is a sufficient condition to guarantee that an optimizer moves downhill on the training loss landscape
- It is unclear the extent to which neural network training resembles convex optimization problems
- Nominal terms, preconditioning, and weight decay improve the inductive bias of learned optimizers
- Nominal terms improve the stability and trainability of learned optimizers
- STAR learned optimizer trains faster, achieves better fully-trained optimizers, and has better stability than prior learned optimizers
- STAR optimizer is able to generalize to never before seen problems
- Weight decay between 0.1 and 0.5 is a good tradeoff between stabilizing behavior and avoiding over-damping
- Adam-style preconditioning improves the blackbox term
- Including the blackbox term improves performance at all points in meta-training and generalization