Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Method to formulate algorithm discovery as program search
  • Leverage efficient search techniques to explore an infinite and sparse program space
  • Introduce program selection and simplification strategies
  • Discovered a simple and effective optimization algorithm, $\textbf{Lion}$
  • Compares Lion with widely used optimizers
  • Lion boosts accuracy and saves compute
  • Lion outperforms Adam in diffusion models
  • Lion exhibits similar or better performance compared to Adam
  • Performance gain grows with training batch size
  • Requires smaller learning rate than Adam
  • Examines limitations of Lion

Paper Content

Introduction

  • Optimization algorithms are important for training neural networks
  • There are many handcrafted optimizers
  • AdamW and Adafactor are the standard optimizers for deep neural networks
  • Lion offers improved accuracy, efficiency, and performance on language modeling
  • Lion requires a smaller learning rate and larger decoupled weight decay

Symbolic discovery of algorithms

  • Algorithm discovery is formulated as program search
  • Symbolic representation (programs) used for advantages such as implementation, analysis, comprehension and transferability
  • Program length used to estimate complexity and select simpler, more generalizable programs
  • Method applicable to deep neural network training and other tasks

Program search space

  • Search space should be flexible to enable discovery of novel algorithms
  • Programs should be easy to analyze and incorporate into machine learning workflow
  • Focus on high-level algorithmic design rather than low-level implementation details
  • Programs contain functions operating over n-dimensional arrays
  • Train function has inputs of model weight, gradient, and learning rate schedule value
  • Train function has outputs of update to weight
  • Extra variables initialized as zeros to collect historical information
  • 45 common math functions used
  • Mutations include inserting, deleting, and modifying statement
  • Search space is infinite and sparse
  • Random search of 2M programs on low-cost proxy task still inferior to AdamW

Efficient search techniques

  • Regularized evolution is used to address the challenges posed by the infinite and sparse search space
  • Tournament selection is used to pick the best performer as the parent
  • The parent is then copied and mutated to produce a child algorithm
  • Evolutionary search is warm-started with AdamW to accelerate the search
  • Multiple restarts from the initial program are used to reduce variance in the search fitness
  • Redundancies in the program space are pruned from three sources
  • Abstract execution is used to detect programs with errors and identify redundant statements
  • Low-cost proxies are used to reduce search cost
  • Search experiments utilize 100 TPU V2 chips and run for ∼72h
  • Five repeats of search experiments are used, followed by another round of search initializing from the best algorithm found thus far

Generalization: program selection and simplification

  • Search experiments can discover promising programs on proxy tasks.
  • Meta-overfitting occurs when the search fitness keeps growing, but the meta-validation metric declines.
  • There is a gap between the proxy tasks during search and the target tasks.
  • Simplification steps are used to arrive at the optimizer Lion.

Derivation

  • Program 4 is obtained by automatically removing redundant statements from Program 8.
  • Program 4 is further simplified to get the final algorithm (Lion) in Program 1.
  • Unnecessary elements are removed from Program 4, including the cosh function, statements using arcsin and clip, and the three red statements.
  • v only changes how the momentum is updated and does not need to be separately tracked.
  • Bias correction is no longer needed, as it does not change the direction.

Analysis

  • Lion algorithm produces uniform magnitude updates across all dimensions
  • Sign operation adds noise to updates, which acts as regularization
  • Lion has higher training error but higher accuracy on validation set
  • Lion leads to smoother convergence and better generalization
  • Lion has fewer hyperparameters than other optimizers
  • Lion needs smaller learning rate and larger decoupled weight decay
  • Lion has smaller memory footprint and faster runtime than AdamW

Image classification

  • Performed experiments on image classification task
  • Used various datasets and architectures
  • Trained from scratch and pre-trained on two datasets
  • ViT-L/16 matches ViT-H/14 results while being 2x smaller
  • Advantage is larger on more challenging benchmarks
  • ViT-G/14 trained by Lion outperforms previous results with 1.8x fewer parameters

Vision-language contrastive learning

  • CLIP style vision-language contrastive training is discussed
  • Image encoder is initialized with a pre-trained model
  • Text encoder is trained in a contrastive manner
  • Lion outperforms AdamW on two datasets

Diffusion model

  • Diffusion models have been successful in image generation
  • Lion is tested for unconditional image synthesis and multimodal text-to-image generation
  • U-Net architecture introduced by Dhariwal and Nichol is utilized

Language modeling and fine-tuning

  • Language modeling and fine-tuning can improve quality for AdamW and Lion
  • Experiments conducted on two smaller-scale datasets
  • Transformer spans three scales: small, medium, and large
  • Lion consistently achieves lower validation perplexity than AdamW
  • Larger-scale experiments conducted on 1.6 trillion tokens
  • Lion outperforms Adafactor on NLG and NLU tasks
  • Lion outperforms AdamW on GLUE benchmark
  • Employed four popular optimizers to train ViT-S/16 and ViT-B/16 on ImageNet
  • Thoroughly tuned peak learning rate and decoupled weight decay of every optimizer
  • Lion is best performing one, no clear winner amongst the baselines
  • AMSGrad performs best on ViT-S/16, worst on ViT-B/16
  • Learning curves of five adaptive optimizers are similar, Lion has unique one that learns faster
  • Ablated effects of β 1 and β 2 , two optimizers created with β values of 0.9 and 0.99
  • Ablated optimization algorithms perform worse than all five compared baselines
  • Validated effectiveness and necessity of using two linear interpolation functions
  • Optimal batch size for AdamW is 256, for Lion is 4,096
  • Lion is more robust to different hyperparameter choices than AdamW
  • Search space is inspired by popular first-order optimization algorithms
  • Search cost is large, algorithm simplification requires manual intervention
  • Program structure is simplistic, lacks functions for advanced second-order algorithms
  • Lion requires larger batch size, but performance remains robust with small batch size
  • Learning rate for Lion is typically 10x smaller than AdamW
  • Decoupled weight decay for Lion is 10x larger than AdamW
  • Evaluation limited to chosen tasks, performance gap narrows with strong augmentations
  • Limitations of Lion include batch size and momentum tracking in bfloat16