Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Deep learning models are trained with hand-designed optimizers.
- This work leverages the same scaling approach behind the success of deep learning to learn versatile optimizers.
- An optimizer for deep learning is trained, which is a small neural network that ingests gradients and outputs parameter updates.
- The optimizer is meta-trained with approximately four thousand TPU-months of compute on a wide variety of optimization tasks.
- The optimizer requires no hyperparameter tuning and automatically adapts to the specifics of the problem being optimized.
Paper Content
Introduction
- Scaling up has been crucial to the success of deep learning
- Scaling brings with it several challenges
- Meta-learning has not seen the same explosion of scale
- Scaling meta-learning systems is harder
- Large training dataset corresponds to a large set of tasks
- Meta-training over a diverse set of realistic tasks is computationally costly
- Balance between overhead and performance must be struck
- VeLO is a versatile learned optimizer that is meta-trained at a far greater scale
- VeLO performs better with less computational overhead
- VeLO requires no hyperparameter tuning
Try velo
- VeLO is designed to be easy to use with any JAX model
- A Colab notebook is provided to train a variety of test problems
- Learned optimization uses an optimizer to train a neural network with parameters
- SGD update is written as U SGD (g; α) = αg
- Learned optimizers replace the fixed-form update rule with a more flexible form
- Update rule U (•; θ) is a neural network with meta-parameters θ
- Meta-training finds the meta-parameters θ of the update rule U (•; θ)
- Meta-loss is commonly defined as the average training loss or the loss at the end of training
- Meta-training can be done with backprop, reinforcement learning, or evolution
- ES provides unbiased estimates of the gradient of a Gaussian-smoothed meta-loss
- Inner-training applies the learned optimizer for N optimization steps
- Meta-loss L φ N (θ) is used to evaluate the performance of the trained model
- Learned optimizer’s architecture is adapted to match the architecture of the problem it is optimizing
Methods: large scale optimizer training
- Learned optimizer architecture
- Distribution of tasks for meta-training
- Details of meta-training
Learned optimizer architecture
- Hierarchical hypernetwork is used to make the optimizer computationally efficient and expressive
- Two-layer hierarchy of computation: per-tensor LSTM and per-parameter MLP
- Capacity of network can be increased by adding computation to either per-tensor or per-parameter network
- Per-tensor LSTM uses 512 hidden-units and a variety of input features
- Per-parameter MLP uses 2-hidden layer, 4-hidden unit MLP and weights are generated by per-tensor model
Data: a diverse distribution of tasks
- Supervised learning does not have standard, large-scale distributions of tasks for learned optimizer training
- Metz et al. [2020a] constructed a parametric task distribution for meta-training
- Tasks are generated by sampling a model family, training dataset, training loss function, and architectural hyperparameters
- Examples of task augmentations include re-parameterizing weight tensors, estimating gradients only in subspaces, introducing asynchronicity in gradient calculation, and changing floating-point precision
- Tasks vary greatly in run time, so rejection sampling is used to meta-train on fast tasks more frequently than slow ones
Meta-training
- Measure of optimization performance is final training loss
- Targeting final loss yields better results than average loss
- Meta-gradient estimated using Evolution Strategies
- Full length unrolls used for meta-gradient evaluation
- Multi-task training used to encourage meta-generalization
- Gradients normalized to unit-length before averaging
- Curriculum used to speed up meta-training
- Vectorization and compilation used to make better use of accelerators
- Data-parallel training on massive cluster
Evaluating learned optimizers
- Evaluation of optimizers in machine learning is difficult
- Evaluating learned optimizers is more difficult
- Three distinct benchmarks presented: VeLOdrome, MLCommons algorithms test problems, and real-world state of the art models
- Investigation of problems in which VeLO fails or underperforms baselines
Velodrome: a canonical evaluation set of 83 tasks
- VeLOdrome is a set of 83 deep learning models designed to be trained on a single accelerator in under an hour
- 15 hand-designed optimizers are evaluated
- Hyperparameter tuning is explored, ranging from 15 trials to 1000 trials
- Nesterov accelerated AdamW is used as a more aggressively-tuned baseline optimizer
- OptList is used to achieve better performance with only 10 hyperparameter evaluation trials
- Learning curves for >1 million trained models are open-sourced
- Performance is compared to a baseline optimizer and reported as improvement in training time
- VeLO outperforms all learning rate-tuned optimizers on all problems
- VeLO performs best on an MLP with dropout and worst on an LSTM with a large vocabulary
Mlcommons tasks
- Investigated a set of 6 tasks from MLCommons algorithms track
- Tasks are out-of-distribution due to their scale
- Compared to Adam baseline with learning rate warm up and cosine decay
- Hyperparameters chosen by MLCommons organizers
- Compared VeLO applied for same and 75% of training iterations
- Results presented in Figure 4
Generalization to tasks unlike any used for meta-training
- VeLO outperforms learning rate-tuned baseline optimizers without any tuning
- VeLO performs comparably to or better than the extensively tuned NAdamW optimizer
- VeLO matches or outperforms Adam on all ViT-B models
- VeLO outperforms the standard stochastic gradient descent (SGD) optimizer with piecewise constant learning rates
- VeLO outperforms the tuned Adam baseline for training large-scale Decision Transformers
- VeLO outperforms the published baseline for knowledge distillation
- VeLO outperforms the hand-tuned baseline for GNN applied to scientific data
- VeLO performs less well on longer training runs
Limitations and failure cases
- VeLO is not comparable in performance to a tuned baseline when asked to optimize tasks which are very unlike tasks in its meta-training distribution
- Performance decreases relative to baselines with larger model size
- Performance lags behind tuned baselines for the largest models
- Performance lags behind baselines, or even decreases, as model size is increased beyond approximately 500M parameters
- VeLO struggles to extend training beyond its initially specified number of iterations
- Naïve Continue, Increase Steps, Reset Steps, and Complete training in a single run are explored for continuation from a completed VeLO training run
- Naïve Continue performs the worst in 52% of experiments
- Increase Steps performs better than all other continuation methods in 49% of the experiments
- Doing the complete training in a single run performs best overall in 38% of the experiments
- VeLO has limited ability to generalize from a non-random initial state
- VeLO fails to escape the local minimum of “standing still” when optimizing ES
Understanding learned optimizer behavior
- Learned optimizers can behave differently than hand-designed optimizers
- Learned optimizers can be difficult to understand due to their complex form
- This section of the paper experiments with VeLO’s behavior
Velo adapts to training horizon
- Learning rate decay is a technique to increase performance near the end of training
- VeLO uses information about the fraction through training to adjust its parameter update steps
- Step size varies between different tasks and parameter tensors, and VeLO learns an implicit schedule with warm-ups and decay
Velo can have a larger critical batch size than baseline optimizers
- Training on large batches is important for distributed training
- Prior work has shown that performance starts to fall off after a certain batch size (critical batch size)
- Optimizers that use momentum and/or preconditioners can increase the critical batch size
- VeLO can make effective use of batches larger than the critical batch size
- VeLO has a critical batch size around 10x larger than baseline methods
Related work
- Idea of meta-learning update rules for optimization dates back to Bengio et al. [1992] and Runarsson and Jonsson [2000]
- Andrychowicz et al. [2016] revived the topic by meta-training an RNN-parameterized learned optimizer
- Extensive work studying different meta-training techniques
- Task specific learned optimizers proposed in many settings
- Improvements to the LSTM learned optimizer architecture proposed
- Hyperparameter controllers-neural networks which dynamically set the hyperparameters of existing optimizers-explored
- Work to meta-learn symbolic parameter update rules
Discussion and outlook
- Demonstrated improvements in generality and performance of learned optimizers
- Scaled up meta-training compute and dataset size
- Made architectural improvements
- Resulting optimizer, VeLO, has no hyperparameters
- Outperforms heavily hyperparameter-tuned baselines across 80+ optimization tasks
Open questions
- Improving the learned optimizer architecture
- Leveraging second order or quasi-second order information
- Using more available information about the target task
- Targeting both validation and training loss
- Reverse-engineering the techniques used by the learned optimizer
- Improving the computational efficiency of meta-training
Meta-learned algorithms are the future
- Machine learning algorithms can outperform hand-designed heuristics.
- Compute and data requirements for training a neural network are much higher than for most supervised learning tasks.
- Meta-learning has been demonstrated in neural architecture search and data augmentation.
- Hand-designed components of machine learning pipelines may be replaced by meta-learned algorithms.
B learned optimizer architecture
- VeLO is a hierarchical structure
- VeLO has components and input features
- VeLO is connected to hyperparameter-controller optimizer architectures
- VeLO’s complexity is related to the complexity of the underlying model
B.1 extended architecture overview
- Hand-designed optimizers are relatively inexpensive to compute.
- Learned optimizers can be more complex and expensive to compute.
- Metz et al. showed that learned optimizers can be parameterized by a small neural network and still outperform hand-designed optimizers.
- Small models lack the capacity to perform well across many tasks.
- Hierarchy in learned optimizer parameterizations can increase capacity without additional compute cost.
B.2 optimizer state: non-learned accumulators
- Track iteration number
- Track momentum at 3 timescales
- Track squared gradients and Adafactor-style accumulators
- Track loss features
B.3 the tensor-level recurrent network
- Fraction of training remaining is used as an input
- Loss features are used to tell if optimization is converging or diverging
- First and second moment features are used
- Tensor rank is used as an additional feature
- A small neural network is used to mix information across tensors
B.4 the parameter-level network
- The per-parameter optimizer is based on a previous study.
- Different weights are computed for each tensor.
- Features are normalized and passed into the weights produced by the tensor-level LSTM hypernetwork.
- The weight update to the parameter vector is calculated using a formula.
B.5 comparing our architecture to hyperparameter controllers
- Hyperparameter controller-based learned optimizers automate the tuning of common optimizers
- HyperNetwork LSTM produces a small number of weights which control the weights of a neural network, similar to how hyperparameters act
B.6 experimental validation of our hypernet compared to past work
- Explored training different learned optimizers on a small scale, multi-task distribution of problems
- Showed meta-training learning curves for each optimizer
- Found HyperNetwork based optimizer had low meta-loss, implying higher capacity
- Hierarchical learned optimizers also performed well, but more expensive to compute
B.7 understanding computational costs of velo
- Predicting exact run times of deep learning systems is complex
- Designed a model for performance based on 3 components: constant execution overhead, per-tensor scaling cost, and per-parameter scaling cost
- Assumed a constant tensor count combining both the constant overhead and the per-tensor scaling
- Tested model with different optimization algorithms on 3 square matrices
- Fitted parameters of model using gradient descent in log space
- Model is well-aligned with data and strongly predictive
- Learned optimizer has higher overhead and significantly larger cost per parameter
- Cost per parameter grows roughly linearly with size of per-parameter MLP
- Per-parameter cost remains constant with size of per-tensor LSTM, but overhead grows
- Overhead goes away as parameter count grows
- Computational costs can be reduced by distributing optimizer computation
- Optimizer overhead ranges from minimal to 2x the cost of training
- Room to optimize with unstructured sparsity and lower precision
C data: a large, diverse distribution of meta-training tasks
- Training deep learning models requires large datasets
- Some tasks are too computationally intensive to be used for meta-training
- This paper proposes a procedural generative process for machine learning tasks
- This process includes a mixture of parametric tasks definitions, such as image classification, image generative modeling, and language modeling
- The paper also proposes a task configuration language and a form of data-augmentation (task-augmentation) to increase diversity
D meta-training
- Meta-training procedure is described
- Meta-objective is discussed
- Gradient estimation strategy is discussed
- Curriculum strategy for training is detailed
- Multi-task training is discussed
- Meta-training objective is the training loss at the end of inner-training
- Objective is computed in expectation over several sampled quantities
D.2 meta-gradient estimation
- Leverage ES with antithetic samples to estimate gradients of the meta-objective
- Initialize target tasks with same parameter values
- Use same batches of data for each antithetic pair and same batches to evaluate performance
- Opt for ES rather than more sophisticated methods for simplicity and lower communication overhead
D.3 curriculum and meta-generalization
- Meta-training on larger scale problems is expensive
- Use curricula and optimizer to save training time
- Experiment to show effect of curriculum over unroll length
- Balance gradient contributions from each task
- Normalize length of each meta-gradient independently per task
E meta-training infrastructure
- Distributed meta-training requires significant compute
- Open source code and components can be adapted to any distributed computing engine
- Requires a distributed file system and a way to perform Remote Procedure Calls
E.1 one learner, many workers
- Meta-training set of machines consists of a single learner process and worker processes
- Learner process runs on a single TPU chip and is more reliable
- Learner process saves weights of the learned optimizer to the distributed file system
- Evaluation chief process monitors the file system and enqueues evaluation configurations
- Evaluation workers train models and report back results to the evaluation chief
- Multiple evaluation clusters monitor performance on different length unrolls and evaluation tasks
E.3 task selection and staleness of meta-gradients
- Training infrastructure consists of workers sampling tasks from a task distribution
- Compiling the computation graph takes multiple minutes
- To reduce waste, multiple gradients are computed for a given static task configuration
- Different settings and dynamic task configurations are used for each gradient estimate
- Sampling fast tasks produces more meta-gradient estimates than slow tasks
- Machines sample more than one task to reduce gradient auto-correlation
- Elements of tasks are resampled to ensure coverage of the task distribution
- Meta-gradients are sent to a centralized learner and weight updates are applied with Adam
- Gradients that are too old are thrown out to combat staleness
- Computational load is different than large supervised models and cheaper hardware is used
- Compute infrastructure consists of TPU chips scattered across the globe
- Outer-batch size is up to ∼100K in largest models
- Peak capacity is over 4K accelerators spanning 3 generations of TPU hardware
- Meta-training took approximately one month
E.4 interactive hyperparameter tuning
- Made modifications to running job, including increasing batch size, lowering learning rate, changing distribution of inner-problems, and modifying maximum staleness
- Inspired by success of online hyperparameter modification in OpenAI Five
- Training divided into 4 phases, each using previous weights as starting point
- Monitored variety of losses and qualitatively tested trained learned optimizers
- Visualized different phases in a variety of ways
E.5 areas of improvement
- TPU utilization is low (<10%) due to mismatch in hardware design
- GPUs are worse due to kernel execution overhead
- Compile time overhead slows down computation
- Sensitivity to cluster status can change training dynamics
- Further iterations of infrastructure needed to be more synchronous