Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Deep learning models are trained with hand-designed optimizers.
This work leverages the same scaling approach behind the success of deep learning to learn versatile optimizers.
An optimizer for deep learning is trained, which is a small neural network that ingests gradients and outputs parameter updates.
The optimizer is meta-trained with approximately four thousand TPU-months of compute on a wide variety of optimization tasks.
The optimizer requires no hyperparameter tuning and automatically adapts to the specifics of the problem being optimized.

Paper Content

Introduction

Scaling up has been crucial to the success of deep learning
Scaling brings with it several challenges
Meta-learning has not seen the same explosion of scale
Scaling meta-learning systems is harder
Large training dataset corresponds to a large set of tasks
Meta-training over a diverse set of realistic tasks is computationally costly
Balance between overhead and performance must be struck
VeLO is a versatile learned optimizer that is meta-trained at a far greater scale
VeLO performs better with less computational overhead
VeLO requires no hyperparameter tuning

Try velo

VeLO is designed to be easy to use with any JAX model
A Colab notebook is provided to train a variety of test problems
Learned optimization uses an optimizer to train a neural network with parameters
SGD update is written as U SGD (g; α) = αg
Learned optimizers replace the fixed-form update rule with a more flexible form
Update rule U (•; θ) is a neural network with meta-parameters θ
Meta-training finds the meta-parameters θ of the update rule U (•; θ)
Meta-loss is commonly defined as the average training loss or the loss at the end of training
Meta-training can be done with backprop, reinforcement learning, or evolution
ES provides unbiased estimates of the gradient of a Gaussian-smoothed meta-loss
Inner-training applies the learned optimizer for N optimization steps
Meta-loss L φ N (θ) is used to evaluate the performance of the trained model
Learned optimizer’s architecture is adapted to match the architecture of the problem it is optimizing

Methods: large scale optimizer training

Learned optimizer architecture
Distribution of tasks for meta-training
Details of meta-training

Learned optimizer architecture

Hierarchical hypernetwork is used to make the optimizer computationally efficient and expressive
Two-layer hierarchy of computation: per-tensor LSTM and per-parameter MLP
Capacity of network can be increased by adding computation to either per-tensor or per-parameter network
Per-tensor LSTM uses 512 hidden-units and a variety of input features
Per-parameter MLP uses 2-hidden layer, 4-hidden unit MLP and weights are generated by per-tensor model

Data: a diverse distribution of tasks

Supervised learning does not have standard, large-scale distributions of tasks for learned optimizer training
Metz et al. [2020a] constructed a parametric task distribution for meta-training
Tasks are generated by sampling a model family, training dataset, training loss function, and architectural hyperparameters
Examples of task augmentations include re-parameterizing weight tensors, estimating gradients only in subspaces, introducing asynchronicity in gradient calculation, and changing floating-point precision
Tasks vary greatly in run time, so rejection sampling is used to meta-train on fast tasks more frequently than slow ones

Meta-training

Measure of optimization performance is final training loss
Targeting final loss yields better results than average loss
Meta-gradient estimated using Evolution Strategies
Full length unrolls used for meta-gradient evaluation
Multi-task training used to encourage meta-generalization
Gradients normalized to unit-length before averaging
Curriculum used to speed up meta-training
Vectorization and compilation used to make better use of accelerators
Data-parallel training on massive cluster

Evaluating learned optimizers

Evaluation of optimizers in machine learning is difficult
Evaluating learned optimizers is more difficult
Three distinct benchmarks presented: VeLOdrome, MLCommons algorithms test problems, and real-world state of the art models
Investigation of problems in which VeLO fails or underperforms baselines

Velodrome: a canonical evaluation set of 83 tasks

VeLOdrome is a set of 83 deep learning models designed to be trained on a single accelerator in under an hour
15 hand-designed optimizers are evaluated
Hyperparameter tuning is explored, ranging from 15 trials to 1000 trials
Nesterov accelerated AdamW is used as a more aggressively-tuned baseline optimizer
OptList is used to achieve better performance with only 10 hyperparameter evaluation trials
Learning curves for >1 million trained models are open-sourced
Performance is compared to a baseline optimizer and reported as improvement in training time
VeLO outperforms all learning rate-tuned optimizers on all problems
VeLO performs best on an MLP with dropout and worst on an LSTM with a large vocabulary

Mlcommons tasks

Investigated a set of 6 tasks from MLCommons algorithms track
Tasks are out-of-distribution due to their scale
Compared to Adam baseline with learning rate warm up and cosine decay
Hyperparameters chosen by MLCommons organizers
Compared VeLO applied for same and 75% of training iterations
Results presented in Figure 4

Generalization to tasks unlike any used for meta-training

VeLO outperforms learning rate-tuned baseline optimizers without any tuning
VeLO performs comparably to or better than the extensively tuned NAdamW optimizer
VeLO matches or outperforms Adam on all ViT-B models
VeLO outperforms the standard stochastic gradient descent (SGD) optimizer with piecewise constant learning rates
VeLO outperforms the tuned Adam baseline for training large-scale Decision Transformers
VeLO outperforms the published baseline for knowledge distillation
VeLO outperforms the hand-tuned baseline for GNN applied to scientific data
VeLO performs less well on longer training runs

Limitations and failure cases

VeLO is not comparable in performance to a tuned baseline when asked to optimize tasks which are very unlike tasks in its meta-training distribution
Performance decreases relative to baselines with larger model size
Performance lags behind tuned baselines for the largest models
Performance lags behind baselines, or even decreases, as model size is increased beyond approximately 500M parameters
VeLO struggles to extend training beyond its initially specified number of iterations
Naïve Continue, Increase Steps, Reset Steps, and Complete training in a single run are explored for continuation from a completed VeLO training run
Naïve Continue performs the worst in 52% of experiments
Increase Steps performs better than all other continuation methods in 49% of the experiments
Doing the complete training in a single run performs best overall in 38% of the experiments
VeLO has limited ability to generalize from a non-random initial state
VeLO fails to escape the local minimum of “standing still” when optimizing ES

Understanding learned optimizer behavior

Learned optimizers can behave differently than hand-designed optimizers
Learned optimizers can be difficult to understand due to their complex form
This section of the paper experiments with VeLO’s behavior

Velo adapts to training horizon

Learning rate decay is a technique to increase performance near the end of training
VeLO uses information about the fraction through training to adjust its parameter update steps
Step size varies between different tasks and parameter tensors, and VeLO learns an implicit schedule with warm-ups and decay

Velo can have a larger critical batch size than baseline optimizers

Training on large batches is important for distributed training
Prior work has shown that performance starts to fall off after a certain batch size (critical batch size)
Optimizers that use momentum and/or preconditioners can increase the critical batch size
VeLO can make effective use of batches larger than the critical batch size
VeLO has a critical batch size around 10x larger than baseline methods

Idea of meta-learning update rules for optimization dates back to Bengio et al. [1992] and Runarsson and Jonsson [2000]
Andrychowicz et al. [2016] revived the topic by meta-training an RNN-parameterized learned optimizer
Extensive work studying different meta-training techniques
Task specific learned optimizers proposed in many settings
Improvements to the LSTM learned optimizer architecture proposed
Hyperparameter controllers-neural networks which dynamically set the hyperparameters of existing optimizers-explored
Work to meta-learn symbolic parameter update rules

Discussion and outlook

Demonstrated improvements in generality and performance of learned optimizers
Scaled up meta-training compute and dataset size
Made architectural improvements
Resulting optimizer, VeLO, has no hyperparameters
Outperforms heavily hyperparameter-tuned baselines across 80+ optimization tasks

Open questions

Improving the learned optimizer architecture
Leveraging second order or quasi-second order information
Using more available information about the target task
Targeting both validation and training loss
Reverse-engineering the techniques used by the learned optimizer
Improving the computational efficiency of meta-training

Meta-learned algorithms are the future

Machine learning algorithms can outperform hand-designed heuristics.
Compute and data requirements for training a neural network are much higher than for most supervised learning tasks.
Meta-learning has been demonstrated in neural architecture search and data augmentation.
Hand-designed components of machine learning pipelines may be replaced by meta-learned algorithms.

B learned optimizer architecture

VeLO is a hierarchical structure
VeLO has components and input features
VeLO is connected to hyperparameter-controller optimizer architectures
VeLO’s complexity is related to the complexity of the underlying model

B.1 extended architecture overview

Hand-designed optimizers are relatively inexpensive to compute.
Learned optimizers can be more complex and expensive to compute.
Metz et al. showed that learned optimizers can be parameterized by a small neural network and still outperform hand-designed optimizers.
Small models lack the capacity to perform well across many tasks.
Hierarchy in learned optimizer parameterizations can increase capacity without additional compute cost.

B.2 optimizer state: non-learned accumulators

Track iteration number
Track momentum at 3 timescales
Track squared gradients and Adafactor-style accumulators
Track loss features

B.3 the tensor-level recurrent network

Fraction of training remaining is used as an input
Loss features are used to tell if optimization is converging or diverging
First and second moment features are used
Tensor rank is used as an additional feature
A small neural network is used to mix information across tensors

B.4 the parameter-level network

The per-parameter optimizer is based on a previous study.
Different weights are computed for each tensor.
Features are normalized and passed into the weights produced by the tensor-level LSTM hypernetwork.
The weight update to the parameter vector is calculated using a formula.

B.5 comparing our architecture to hyperparameter controllers

Hyperparameter controller-based learned optimizers automate the tuning of common optimizers
HyperNetwork LSTM produces a small number of weights which control the weights of a neural network, similar to how hyperparameters act

B.6 experimental validation of our hypernet compared to past work

Explored training different learned optimizers on a small scale, multi-task distribution of problems
Showed meta-training learning curves for each optimizer
Found HyperNetwork based optimizer had low meta-loss, implying higher capacity
Hierarchical learned optimizers also performed well, but more expensive to compute

B.7 understanding computational costs of velo

Predicting exact run times of deep learning systems is complex
Designed a model for performance based on 3 components: constant execution overhead, per-tensor scaling cost, and per-parameter scaling cost
Assumed a constant tensor count combining both the constant overhead and the per-tensor scaling
Tested model with different optimization algorithms on 3 square matrices
Fitted parameters of model using gradient descent in log space
Model is well-aligned with data and strongly predictive
Learned optimizer has higher overhead and significantly larger cost per parameter
Cost per parameter grows roughly linearly with size of per-parameter MLP
Per-parameter cost remains constant with size of per-tensor LSTM, but overhead grows
Overhead goes away as parameter count grows
Computational costs can be reduced by distributing optimizer computation
Optimizer overhead ranges from minimal to 2x the cost of training
Room to optimize with unstructured sparsity and lower precision

C data: a large, diverse distribution of meta-training tasks

Training deep learning models requires large datasets
Some tasks are too computationally intensive to be used for meta-training
This paper proposes a procedural generative process for machine learning tasks
This process includes a mixture of parametric tasks definitions, such as image classification, image generative modeling, and language modeling
The paper also proposes a task configuration language and a form of data-augmentation (task-augmentation) to increase diversity

D meta-training

Meta-training procedure is described
Meta-objective is discussed
Gradient estimation strategy is discussed
Curriculum strategy for training is detailed
Multi-task training is discussed
Meta-training objective is the training loss at the end of inner-training
Objective is computed in expectation over several sampled quantities

D.2 meta-gradient estimation

Leverage ES with antithetic samples to estimate gradients of the meta-objective
Initialize target tasks with same parameter values
Use same batches of data for each antithetic pair and same batches to evaluate performance
Opt for ES rather than more sophisticated methods for simplicity and lower communication overhead

D.3 curriculum and meta-generalization

Meta-training on larger scale problems is expensive
Use curricula and optimizer to save training time
Experiment to show effect of curriculum over unroll length
Balance gradient contributions from each task
Normalize length of each meta-gradient independently per task

E meta-training infrastructure

Distributed meta-training requires significant compute
Open source code and components can be adapted to any distributed computing engine
Requires a distributed file system and a way to perform Remote Procedure Calls

E.1 one learner, many workers

Meta-training set of machines consists of a single learner process and worker processes
Learner process runs on a single TPU chip and is more reliable
Learner process saves weights of the learned optimizer to the distributed file system
Evaluation chief process monitors the file system and enqueues evaluation configurations
Evaluation workers train models and report back results to the evaluation chief
Multiple evaluation clusters monitor performance on different length unrolls and evaluation tasks

E.3 task selection and staleness of meta-gradients

Training infrastructure consists of workers sampling tasks from a task distribution
Compiling the computation graph takes multiple minutes
To reduce waste, multiple gradients are computed for a given static task configuration
Different settings and dynamic task configurations are used for each gradient estimate
Sampling fast tasks produces more meta-gradient estimates than slow tasks
Machines sample more than one task to reduce gradient auto-correlation
Elements of tasks are resampled to ensure coverage of the task distribution
Meta-gradients are sent to a centralized learner and weight updates are applied with Adam
Gradients that are too old are thrown out to combat staleness
Computational load is different than large supervised models and cheaper hardware is used
Compute infrastructure consists of TPU chips scattered across the globe
Outer-batch size is up to ∼100K in largest models
Peak capacity is over 4K accelerators spanning 3 generations of TPU hardware
Meta-training took approximately one month

E.4 interactive hyperparameter tuning

Made modifications to running job, including increasing batch size, lowering learning rate, changing distribution of inner-problems, and modifying maximum staleness
Inspired by success of online hyperparameter modification in OpenAI Five
Training divided into 4 phases, each using previous weights as starting point
Monitored variety of losses and qualitatively tested trained learned optimizers
Visualized different phases in a variety of ways

E.5 areas of improvement

TPU utilization is low (<10%) due to mismatch in hardware design
GPUs are worse due to kernel execution overhead
Compile time overhead slows down computation
Sensitivity to cluster status can change training dynamics
Further iterations of infrastructure needed to be more synchronous

Link to paper#

Abstract#

Paper Content#

Introduction#

Try velo#

Methods: large scale optimizer training#

Learned optimizer architecture#

Data: a diverse distribution of tasks#

Meta-training#

Evaluating learned optimizers#

Velodrome: a canonical evaluation set of 83 tasks#

Mlcommons tasks#

Generalization to tasks unlike any used for meta-training#

Limitations and failure cases#

Understanding learned optimizer behavior#

Velo adapts to training horizon#

Velo can have a larger critical batch size than baseline optimizers#

Related work#

Discussion and outlook#

Open questions#

Meta-learned algorithms are the future#

B learned optimizer architecture#

B.1 extended architecture overview#

B.2 optimizer state: non-learned accumulators#

B.3 the tensor-level recurrent network#

B.4 the parameter-level network#

B.5 comparing our architecture to hyperparameter controllers#

B.6 experimental validation of our hypernet compared to past work#

B.7 understanding computational costs of velo#

C data: a large, diverse distribution of meta-training tasks#

D meta-training#

D.2 meta-gradient estimation#

D.3 curriculum and meta-generalization#

E meta-training infrastructure#

E.1 one learner, many workers#

E.3 task selection and staleness of meta-gradients#

E.4 interactive hyperparameter tuning#

E.5 areas of improvement#

Link to paper

Abstract

Paper Content

Introduction

Try velo

Methods: large scale optimizer training

Learned optimizer architecture

Data: a diverse distribution of tasks

Meta-training

Evaluating learned optimizers

Velodrome: a canonical evaluation set of 83 tasks

Mlcommons tasks

Generalization to tasks unlike any used for meta-training

Limitations and failure cases

Understanding learned optimizer behavior

Velo adapts to training horizon

Velo can have a larger critical batch size than baseline optimizers

Related work

Discussion and outlook

Open questions

Meta-learned algorithms are the future

B learned optimizer architecture

B.1 extended architecture overview

B.2 optimizer state: non-learned accumulators

B.3 the tensor-level recurrent network

B.4 the parameter-level network

B.5 comparing our architecture to hyperparameter controllers

B.6 experimental validation of our hypernet compared to past work

B.7 understanding computational costs of velo

C data: a large, diverse distribution of meta-training tasks

D meta-training

D.2 meta-gradient estimation

D.3 curriculum and meta-generalization

E meta-training infrastructure

E.1 one learner, many workers

E.3 task selection and staleness of meta-gradients

E.4 interactive hyperparameter tuning

E.5 areas of improvement