Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Method to formulate algorithm discovery as program search
- Leverage efficient search techniques to explore an infinite and sparse program space
- Introduce program selection and simplification strategies
- Discovered a simple and effective optimization algorithm, $\textbf{Lion}$
- Compares Lion with widely used optimizers
- Lion boosts accuracy and saves compute
- Lion outperforms Adam in diffusion models
- Lion exhibits similar or better performance compared to Adam
- Performance gain grows with training batch size
- Requires smaller learning rate than Adam
- Examines limitations of Lion
Paper Content
Introduction
- Optimization algorithms are important for training neural networks
- There are many handcrafted optimizers
- AdamW and Adafactor are the standard optimizers for deep neural networks
- Lion offers improved accuracy, efficiency, and performance on language modeling
- Lion requires a smaller learning rate and larger decoupled weight decay
Symbolic discovery of algorithms
- Algorithm discovery is formulated as program search
- Symbolic representation (programs) used for advantages such as implementation, analysis, comprehension and transferability
- Program length used to estimate complexity and select simpler, more generalizable programs
- Method applicable to deep neural network training and other tasks
Program search space
- Search space should be flexible to enable discovery of novel algorithms
- Programs should be easy to analyze and incorporate into machine learning workflow
- Focus on high-level algorithmic design rather than low-level implementation details
- Programs contain functions operating over n-dimensional arrays
- Train function has inputs of model weight, gradient, and learning rate schedule value
- Train function has outputs of update to weight
- Extra variables initialized as zeros to collect historical information
- 45 common math functions used
- Mutations include inserting, deleting, and modifying statement
- Search space is infinite and sparse
- Random search of 2M programs on low-cost proxy task still inferior to AdamW
Efficient search techniques
- Regularized evolution is used to address the challenges posed by the infinite and sparse search space
- Tournament selection is used to pick the best performer as the parent
- The parent is then copied and mutated to produce a child algorithm
- Evolutionary search is warm-started with AdamW to accelerate the search
- Multiple restarts from the initial program are used to reduce variance in the search fitness
- Redundancies in the program space are pruned from three sources
- Abstract execution is used to detect programs with errors and identify redundant statements
- Low-cost proxies are used to reduce search cost
- Search experiments utilize 100 TPU V2 chips and run for ∼72h
- Five repeats of search experiments are used, followed by another round of search initializing from the best algorithm found thus far
Generalization: program selection and simplification
- Search experiments can discover promising programs on proxy tasks.
- Meta-overfitting occurs when the search fitness keeps growing, but the meta-validation metric declines.
- There is a gap between the proxy tasks during search and the target tasks.
- Simplification steps are used to arrive at the optimizer Lion.
Derivation
- Program 4 is obtained by automatically removing redundant statements from Program 8.
- Program 4 is further simplified to get the final algorithm (Lion) in Program 1.
- Unnecessary elements are removed from Program 4, including the cosh function, statements using arcsin and clip, and the three red statements.
- v only changes how the momentum is updated and does not need to be separately tracked.
- Bias correction is no longer needed, as it does not change the direction.
Analysis
- Lion algorithm produces uniform magnitude updates across all dimensions
- Sign operation adds noise to updates, which acts as regularization
- Lion has higher training error but higher accuracy on validation set
- Lion leads to smoother convergence and better generalization
- Lion has fewer hyperparameters than other optimizers
- Lion needs smaller learning rate and larger decoupled weight decay
- Lion has smaller memory footprint and faster runtime than AdamW
Image classification
- Performed experiments on image classification task
- Used various datasets and architectures
- Trained from scratch and pre-trained on two datasets
- ViT-L/16 matches ViT-H/14 results while being 2x smaller
- Advantage is larger on more challenging benchmarks
- ViT-G/14 trained by Lion outperforms previous results with 1.8x fewer parameters
Vision-language contrastive learning
- CLIP style vision-language contrastive training is discussed
- Image encoder is initialized with a pre-trained model
- Text encoder is trained in a contrastive manner
- Lion outperforms AdamW on two datasets
Diffusion model
- Diffusion models have been successful in image generation
- Lion is tested for unconditional image synthesis and multimodal text-to-image generation
- U-Net architecture introduced by Dhariwal and Nichol is utilized
Language modeling and fine-tuning
- Language modeling and fine-tuning can improve quality for AdamW and Lion
- Experiments conducted on two smaller-scale datasets
- Transformer spans three scales: small, medium, and large
- Lion consistently achieves lower validation perplexity than AdamW
- Larger-scale experiments conducted on 1.6 trillion tokens
- Lion outperforms Adafactor on NLG and NLU tasks
- Lion outperforms AdamW on GLUE benchmark
Comparison with other popular optimizers
- Employed four popular optimizers to train ViT-S/16 and ViT-B/16 on ImageNet
- Thoroughly tuned peak learning rate and decoupled weight decay of every optimizer
- Lion is best performing one, no clear winner amongst the baselines
- AMSGrad performs best on ViT-S/16, worst on ViT-B/16
- Learning curves of five adaptive optimizers are similar, Lion has unique one that learns faster
- Ablated effects of β 1 and β 2 , two optimizers created with β values of 0.9 and 0.99
- Ablated optimization algorithms perform worse than all five compared baselines
- Validated effectiveness and necessity of using two linear interpolation functions
- Optimal batch size for AdamW is 256, for Lion is 4,096
- Lion is more robust to different hyperparameter choices than AdamW
- Search space is inspired by popular first-order optimization algorithms
- Search cost is large, algorithm simplification requires manual intervention
- Program structure is simplistic, lacks functions for advanced second-order algorithms
- Lion requires larger batch size, but performance remains robust with small batch size
- Learning rate for Lion is typically 10x smaller than AdamW
- Decoupled weight decay for Lion is 10x larger than AdamW
- Evaluation limited to chosen tasks, performance gap narrows with strong augmentations
- Limitations of Lion include batch size and momentum tracking in bfloat16