Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Method to formulate algorithm discovery as program search
Leverage efficient search techniques to explore an infinite and sparse program space
Introduce program selection and simplification strategies
Discovered a simple and effective optimization algorithm, $\textbf{Lion}$
Compares Lion with widely used optimizers
Lion boosts accuracy and saves compute
Lion outperforms Adam in diffusion models
Lion exhibits similar or better performance compared to Adam
Performance gain grows with training batch size
Requires smaller learning rate than Adam
Examines limitations of Lion

Paper Content

Introduction

Optimization algorithms are important for training neural networks
There are many handcrafted optimizers
AdamW and Adafactor are the standard optimizers for deep neural networks
Lion offers improved accuracy, efficiency, and performance on language modeling
Lion requires a smaller learning rate and larger decoupled weight decay

Symbolic discovery of algorithms

Algorithm discovery is formulated as program search
Symbolic representation (programs) used for advantages such as implementation, analysis, comprehension and transferability
Program length used to estimate complexity and select simpler, more generalizable programs
Method applicable to deep neural network training and other tasks

Program search space

Search space should be flexible to enable discovery of novel algorithms
Programs should be easy to analyze and incorporate into machine learning workflow
Focus on high-level algorithmic design rather than low-level implementation details
Programs contain functions operating over n-dimensional arrays
Train function has inputs of model weight, gradient, and learning rate schedule value
Train function has outputs of update to weight
Extra variables initialized as zeros to collect historical information
45 common math functions used
Mutations include inserting, deleting, and modifying statement
Search space is infinite and sparse
Random search of 2M programs on low-cost proxy task still inferior to AdamW

Efficient search techniques

Regularized evolution is used to address the challenges posed by the infinite and sparse search space
Tournament selection is used to pick the best performer as the parent
The parent is then copied and mutated to produce a child algorithm
Evolutionary search is warm-started with AdamW to accelerate the search
Multiple restarts from the initial program are used to reduce variance in the search fitness
Redundancies in the program space are pruned from three sources
Abstract execution is used to detect programs with errors and identify redundant statements
Low-cost proxies are used to reduce search cost
Search experiments utilize 100 TPU V2 chips and run for ∼72h
Five repeats of search experiments are used, followed by another round of search initializing from the best algorithm found thus far

Generalization: program selection and simplification

Search experiments can discover promising programs on proxy tasks.
Meta-overfitting occurs when the search fitness keeps growing, but the meta-validation metric declines.
There is a gap between the proxy tasks during search and the target tasks.
Simplification steps are used to arrive at the optimizer Lion.

Derivation

Program 4 is obtained by automatically removing redundant statements from Program 8.
Program 4 is further simplified to get the final algorithm (Lion) in Program 1.
Unnecessary elements are removed from Program 4, including the cosh function, statements using arcsin and clip, and the three red statements.
v only changes how the momentum is updated and does not need to be separately tracked.
Bias correction is no longer needed, as it does not change the direction.

Analysis

Lion algorithm produces uniform magnitude updates across all dimensions
Sign operation adds noise to updates, which acts as regularization
Lion has higher training error but higher accuracy on validation set
Lion leads to smoother convergence and better generalization
Lion has fewer hyperparameters than other optimizers
Lion needs smaller learning rate and larger decoupled weight decay
Lion has smaller memory footprint and faster runtime than AdamW

Image classification

Performed experiments on image classification task
Used various datasets and architectures
Trained from scratch and pre-trained on two datasets
ViT-L/16 matches ViT-H/14 results while being 2x smaller
Advantage is larger on more challenging benchmarks
ViT-G/14 trained by Lion outperforms previous results with 1.8x fewer parameters

Vision-language contrastive learning

CLIP style vision-language contrastive training is discussed
Image encoder is initialized with a pre-trained model
Text encoder is trained in a contrastive manner
Lion outperforms AdamW on two datasets

Diffusion model

Diffusion models have been successful in image generation
Lion is tested for unconditional image synthesis and multimodal text-to-image generation
U-Net architecture introduced by Dhariwal and Nichol is utilized

Language modeling and fine-tuning

Language modeling and fine-tuning can improve quality for AdamW and Lion
Experiments conducted on two smaller-scale datasets
Transformer spans three scales: small, medium, and large
Lion consistently achieves lower validation perplexity than AdamW
Larger-scale experiments conducted on 1.6 trillion tokens
Lion outperforms Adafactor on NLG and NLU tasks
Lion outperforms AdamW on GLUE benchmark

Comparison with other popular optimizers

Employed four popular optimizers to train ViT-S/16 and ViT-B/16 on ImageNet
Thoroughly tuned peak learning rate and decoupled weight decay of every optimizer
Lion is best performing one, no clear winner amongst the baselines
AMSGrad performs best on ViT-S/16, worst on ViT-B/16
Learning curves of five adaptive optimizers are similar, Lion has unique one that learns faster
Ablated effects of β 1 and β 2 , two optimizers created with β values of 0.9 and 0.99
Ablated optimization algorithms perform worse than all five compared baselines
Validated effectiveness and necessity of using two linear interpolation functions
Optimal batch size for AdamW is 256, for Lion is 4,096
Lion is more robust to different hyperparameter choices than AdamW
Search space is inspired by popular first-order optimization algorithms
Search cost is large, algorithm simplification requires manual intervention
Program structure is simplistic, lacks functions for advanced second-order algorithms
Lion requires larger batch size, but performance remains robust with small batch size
Learning rate for Lion is typically 10x smaller than AdamW
Decoupled weight decay for Lion is 10x larger than AdamW
Evaluation limited to chosen tasks, performance gap narrows with strong augmentations
Limitations of Lion include batch size and momentum tracking in bfloat16

Link to paper#

Abstract#

Paper Content#

Introduction#

Symbolic discovery of algorithms#

Program search space#

Efficient search techniques#

Generalization: program selection and simplification#

Derivation#

Analysis#

Image classification#

Vision-language contrastive learning#

Diffusion model#

Language modeling and fine-tuning#

Comparison with other popular optimizers#

Link to paper

Abstract

Paper Content

Introduction

Symbolic discovery of algorithms

Program search space

Efficient search techniques

Generalization: program selection and simplification

Derivation

Analysis

Image classification

Vision-language contrastive learning

Diffusion model

Language modeling and fine-tuning

Comparison with other popular optimizers