Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Deep learning models are trained with hand-designed optimizers.
  • This work leverages the same scaling approach behind the success of deep learning to learn versatile optimizers.
  • An optimizer for deep learning is trained, which is a small neural network that ingests gradients and outputs parameter updates.
  • The optimizer is meta-trained with approximately four thousand TPU-months of compute on a wide variety of optimization tasks.
  • The optimizer requires no hyperparameter tuning and automatically adapts to the specifics of the problem being optimized.

Paper Content

Introduction

  • Scaling up has been crucial to the success of deep learning
  • Scaling brings with it several challenges
  • Meta-learning has not seen the same explosion of scale
  • Scaling meta-learning systems is harder
  • Large training dataset corresponds to a large set of tasks
  • Meta-training over a diverse set of realistic tasks is computationally costly
  • Balance between overhead and performance must be struck
  • VeLO is a versatile learned optimizer that is meta-trained at a far greater scale
  • VeLO performs better with less computational overhead
  • VeLO requires no hyperparameter tuning

Try velo

  • VeLO is designed to be easy to use with any JAX model
  • A Colab notebook is provided to train a variety of test problems
  • Learned optimization uses an optimizer to train a neural network with parameters
  • SGD update is written as U SGD (g; α) = αg
  • Learned optimizers replace the fixed-form update rule with a more flexible form
  • Update rule U (•; θ) is a neural network with meta-parameters θ
  • Meta-training finds the meta-parameters θ of the update rule U (•; θ)
  • Meta-loss is commonly defined as the average training loss or the loss at the end of training
  • Meta-training can be done with backprop, reinforcement learning, or evolution
  • ES provides unbiased estimates of the gradient of a Gaussian-smoothed meta-loss
  • Inner-training applies the learned optimizer for N optimization steps
  • Meta-loss L φ N (θ) is used to evaluate the performance of the trained model
  • Learned optimizer’s architecture is adapted to match the architecture of the problem it is optimizing

Methods: large scale optimizer training

  • Learned optimizer architecture
  • Distribution of tasks for meta-training
  • Details of meta-training

Learned optimizer architecture

  • Hierarchical hypernetwork is used to make the optimizer computationally efficient and expressive
  • Two-layer hierarchy of computation: per-tensor LSTM and per-parameter MLP
  • Capacity of network can be increased by adding computation to either per-tensor or per-parameter network
  • Per-tensor LSTM uses 512 hidden-units and a variety of input features
  • Per-parameter MLP uses 2-hidden layer, 4-hidden unit MLP and weights are generated by per-tensor model

Data: a diverse distribution of tasks

  • Supervised learning does not have standard, large-scale distributions of tasks for learned optimizer training
  • Metz et al. [2020a] constructed a parametric task distribution for meta-training
  • Tasks are generated by sampling a model family, training dataset, training loss function, and architectural hyperparameters
  • Examples of task augmentations include re-parameterizing weight tensors, estimating gradients only in subspaces, introducing asynchronicity in gradient calculation, and changing floating-point precision
  • Tasks vary greatly in run time, so rejection sampling is used to meta-train on fast tasks more frequently than slow ones

Meta-training

  • Measure of optimization performance is final training loss
  • Targeting final loss yields better results than average loss
  • Meta-gradient estimated using Evolution Strategies
  • Full length unrolls used for meta-gradient evaluation
  • Multi-task training used to encourage meta-generalization
  • Gradients normalized to unit-length before averaging
  • Curriculum used to speed up meta-training
  • Vectorization and compilation used to make better use of accelerators
  • Data-parallel training on massive cluster

Evaluating learned optimizers

  • Evaluation of optimizers in machine learning is difficult
  • Evaluating learned optimizers is more difficult
  • Three distinct benchmarks presented: VeLOdrome, MLCommons algorithms test problems, and real-world state of the art models
  • Investigation of problems in which VeLO fails or underperforms baselines

Velodrome: a canonical evaluation set of 83 tasks

  • VeLOdrome is a set of 83 deep learning models designed to be trained on a single accelerator in under an hour
  • 15 hand-designed optimizers are evaluated
  • Hyperparameter tuning is explored, ranging from 15 trials to 1000 trials
  • Nesterov accelerated AdamW is used as a more aggressively-tuned baseline optimizer
  • OptList is used to achieve better performance with only 10 hyperparameter evaluation trials
  • Learning curves for >1 million trained models are open-sourced
  • Performance is compared to a baseline optimizer and reported as improvement in training time
  • VeLO outperforms all learning rate-tuned optimizers on all problems
  • VeLO performs best on an MLP with dropout and worst on an LSTM with a large vocabulary

Mlcommons tasks

  • Investigated a set of 6 tasks from MLCommons algorithms track
  • Tasks are out-of-distribution due to their scale
  • Compared to Adam baseline with learning rate warm up and cosine decay
  • Hyperparameters chosen by MLCommons organizers
  • Compared VeLO applied for same and 75% of training iterations
  • Results presented in Figure 4

Generalization to tasks unlike any used for meta-training

  • VeLO outperforms learning rate-tuned baseline optimizers without any tuning
  • VeLO performs comparably to or better than the extensively tuned NAdamW optimizer
  • VeLO matches or outperforms Adam on all ViT-B models
  • VeLO outperforms the standard stochastic gradient descent (SGD) optimizer with piecewise constant learning rates
  • VeLO outperforms the tuned Adam baseline for training large-scale Decision Transformers
  • VeLO outperforms the published baseline for knowledge distillation
  • VeLO outperforms the hand-tuned baseline for GNN applied to scientific data
  • VeLO performs less well on longer training runs

Limitations and failure cases

  • VeLO is not comparable in performance to a tuned baseline when asked to optimize tasks which are very unlike tasks in its meta-training distribution
  • Performance decreases relative to baselines with larger model size
  • Performance lags behind tuned baselines for the largest models
  • Performance lags behind baselines, or even decreases, as model size is increased beyond approximately 500M parameters
  • VeLO struggles to extend training beyond its initially specified number of iterations
  • Naïve Continue, Increase Steps, Reset Steps, and Complete training in a single run are explored for continuation from a completed VeLO training run
  • Naïve Continue performs the worst in 52% of experiments
  • Increase Steps performs better than all other continuation methods in 49% of the experiments
  • Doing the complete training in a single run performs best overall in 38% of the experiments
  • VeLO has limited ability to generalize from a non-random initial state
  • VeLO fails to escape the local minimum of “standing still” when optimizing ES

Understanding learned optimizer behavior

  • Learned optimizers can behave differently than hand-designed optimizers
  • Learned optimizers can be difficult to understand due to their complex form
  • This section of the paper experiments with VeLO’s behavior

Velo adapts to training horizon

  • Learning rate decay is a technique to increase performance near the end of training
  • VeLO uses information about the fraction through training to adjust its parameter update steps
  • Step size varies between different tasks and parameter tensors, and VeLO learns an implicit schedule with warm-ups and decay

Velo can have a larger critical batch size than baseline optimizers

  • Training on large batches is important for distributed training
  • Prior work has shown that performance starts to fall off after a certain batch size (critical batch size)
  • Optimizers that use momentum and/or preconditioners can increase the critical batch size
  • VeLO can make effective use of batches larger than the critical batch size
  • VeLO has a critical batch size around 10x larger than baseline methods
  • Idea of meta-learning update rules for optimization dates back to Bengio et al. [1992] and Runarsson and Jonsson [2000]
  • Andrychowicz et al. [2016] revived the topic by meta-training an RNN-parameterized learned optimizer
  • Extensive work studying different meta-training techniques
  • Task specific learned optimizers proposed in many settings
  • Improvements to the LSTM learned optimizer architecture proposed
  • Hyperparameter controllers-neural networks which dynamically set the hyperparameters of existing optimizers-explored
  • Work to meta-learn symbolic parameter update rules

Discussion and outlook

  • Demonstrated improvements in generality and performance of learned optimizers
  • Scaled up meta-training compute and dataset size
  • Made architectural improvements
  • Resulting optimizer, VeLO, has no hyperparameters
  • Outperforms heavily hyperparameter-tuned baselines across 80+ optimization tasks

Open questions

  • Improving the learned optimizer architecture
  • Leveraging second order or quasi-second order information
  • Using more available information about the target task
  • Targeting both validation and training loss
  • Reverse-engineering the techniques used by the learned optimizer
  • Improving the computational efficiency of meta-training

Meta-learned algorithms are the future

  • Machine learning algorithms can outperform hand-designed heuristics.
  • Compute and data requirements for training a neural network are much higher than for most supervised learning tasks.
  • Meta-learning has been demonstrated in neural architecture search and data augmentation.
  • Hand-designed components of machine learning pipelines may be replaced by meta-learned algorithms.

B learned optimizer architecture

  • VeLO is a hierarchical structure
  • VeLO has components and input features
  • VeLO is connected to hyperparameter-controller optimizer architectures
  • VeLO’s complexity is related to the complexity of the underlying model

B.1 extended architecture overview

  • Hand-designed optimizers are relatively inexpensive to compute.
  • Learned optimizers can be more complex and expensive to compute.
  • Metz et al. showed that learned optimizers can be parameterized by a small neural network and still outperform hand-designed optimizers.
  • Small models lack the capacity to perform well across many tasks.
  • Hierarchy in learned optimizer parameterizations can increase capacity without additional compute cost.

B.2 optimizer state: non-learned accumulators

  • Track iteration number
  • Track momentum at 3 timescales
  • Track squared gradients and Adafactor-style accumulators
  • Track loss features

B.3 the tensor-level recurrent network

  • Fraction of training remaining is used as an input
  • Loss features are used to tell if optimization is converging or diverging
  • First and second moment features are used
  • Tensor rank is used as an additional feature
  • A small neural network is used to mix information across tensors

B.4 the parameter-level network

  • The per-parameter optimizer is based on a previous study.
  • Different weights are computed for each tensor.
  • Features are normalized and passed into the weights produced by the tensor-level LSTM hypernetwork.
  • The weight update to the parameter vector is calculated using a formula.

B.5 comparing our architecture to hyperparameter controllers

  • Hyperparameter controller-based learned optimizers automate the tuning of common optimizers
  • HyperNetwork LSTM produces a small number of weights which control the weights of a neural network, similar to how hyperparameters act

B.6 experimental validation of our hypernet compared to past work

  • Explored training different learned optimizers on a small scale, multi-task distribution of problems
  • Showed meta-training learning curves for each optimizer
  • Found HyperNetwork based optimizer had low meta-loss, implying higher capacity
  • Hierarchical learned optimizers also performed well, but more expensive to compute

B.7 understanding computational costs of velo

  • Predicting exact run times of deep learning systems is complex
  • Designed a model for performance based on 3 components: constant execution overhead, per-tensor scaling cost, and per-parameter scaling cost
  • Assumed a constant tensor count combining both the constant overhead and the per-tensor scaling
  • Tested model with different optimization algorithms on 3 square matrices
  • Fitted parameters of model using gradient descent in log space
  • Model is well-aligned with data and strongly predictive
  • Learned optimizer has higher overhead and significantly larger cost per parameter
  • Cost per parameter grows roughly linearly with size of per-parameter MLP
  • Per-parameter cost remains constant with size of per-tensor LSTM, but overhead grows
  • Overhead goes away as parameter count grows
  • Computational costs can be reduced by distributing optimizer computation
  • Optimizer overhead ranges from minimal to 2x the cost of training
  • Room to optimize with unstructured sparsity and lower precision

C data: a large, diverse distribution of meta-training tasks

  • Training deep learning models requires large datasets
  • Some tasks are too computationally intensive to be used for meta-training
  • This paper proposes a procedural generative process for machine learning tasks
  • This process includes a mixture of parametric tasks definitions, such as image classification, image generative modeling, and language modeling
  • The paper also proposes a task configuration language and a form of data-augmentation (task-augmentation) to increase diversity

D meta-training

  • Meta-training procedure is described
  • Meta-objective is discussed
  • Gradient estimation strategy is discussed
  • Curriculum strategy for training is detailed
  • Multi-task training is discussed
  • Meta-training objective is the training loss at the end of inner-training
  • Objective is computed in expectation over several sampled quantities

D.2 meta-gradient estimation

  • Leverage ES with antithetic samples to estimate gradients of the meta-objective
  • Initialize target tasks with same parameter values
  • Use same batches of data for each antithetic pair and same batches to evaluate performance
  • Opt for ES rather than more sophisticated methods for simplicity and lower communication overhead

D.3 curriculum and meta-generalization

  • Meta-training on larger scale problems is expensive
  • Use curricula and optimizer to save training time
  • Experiment to show effect of curriculum over unroll length
  • Balance gradient contributions from each task
  • Normalize length of each meta-gradient independently per task

E meta-training infrastructure

  • Distributed meta-training requires significant compute
  • Open source code and components can be adapted to any distributed computing engine
  • Requires a distributed file system and a way to perform Remote Procedure Calls

E.1 one learner, many workers

  • Meta-training set of machines consists of a single learner process and worker processes
  • Learner process runs on a single TPU chip and is more reliable
  • Learner process saves weights of the learned optimizer to the distributed file system
  • Evaluation chief process monitors the file system and enqueues evaluation configurations
  • Evaluation workers train models and report back results to the evaluation chief
  • Multiple evaluation clusters monitor performance on different length unrolls and evaluation tasks

E.3 task selection and staleness of meta-gradients

  • Training infrastructure consists of workers sampling tasks from a task distribution
  • Compiling the computation graph takes multiple minutes
  • To reduce waste, multiple gradients are computed for a given static task configuration
  • Different settings and dynamic task configurations are used for each gradient estimate
  • Sampling fast tasks produces more meta-gradient estimates than slow tasks
  • Machines sample more than one task to reduce gradient auto-correlation
  • Elements of tasks are resampled to ensure coverage of the task distribution
  • Meta-gradients are sent to a centralized learner and weight updates are applied with Adam
  • Gradients that are too old are thrown out to combat staleness
  • Computational load is different than large supervised models and cheaper hardware is used
  • Compute infrastructure consists of TPU chips scattered across the globe
  • Outer-batch size is up to ∼100K in largest models
  • Peak capacity is over 4K accelerators spanning 3 generations of TPU hardware
  • Meta-training took approximately one month

E.4 interactive hyperparameter tuning

  • Made modifications to running job, including increasing batch size, lowering learning rate, changing distribution of inner-problems, and modifying maximum staleness
  • Inspired by success of online hyperparameter modification in OpenAI Five
  • Training divided into 4 phases, each using previous weights as starting point
  • Monitored variety of losses and qualitatively tested trained learned optimizers
  • Visualized different phases in a variety of ways

E.5 areas of improvement

  • TPU utilization is low (<10%) due to mismatch in hardware design
  • GPUs are worse due to kernel execution overhead
  • Compile time overhead slows down computation
  • Sensitivity to cluster status can change training dynamics
  • Further iterations of infrastructure needed to be more synchronous