Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Scaling up language models has led to performance gains
  • Little is understood about how training dynamics change with model size
  • Analyzed intermediate training checkpoints of differently sized models
  • At a given perplexity, similar subset of training tokens see most significant reduction in loss
  • Early in training, all models learn to reduce perplexity of grammatical sequences with hallucinations
  • Perplexity is a strong predictor of in-context learning performance

Paper Content

Introduction

  • Scaling up language models improves language modeling perplexity and zero-or few-shot end task accuracies
  • Little is understood about why or how this happens
  • Study training trajectories of differently-sized OPT models
  • Analyze three aspects of model performance: next-token prediction, sequence-level generation, and downstream task performance
  • Find that language modeling perplexity correlates well with few-shot in-context learning performance

Experimental settings

  • OPT models used in experiments
  • Validation perplexity used to measure autoregressive language modeling
  • Validation set covers a wide range of domains
  • Trajectory of validation perplexity follows a power-law pattern

Next-token prediction

  • Autoregressive language models are used to predict the next token given a context.
  • Validation perplexity decreases as training progresses.
  • This section studies the trajectory of next-token predictions, divided into three categories.

Methodology

  • Combines signals from two models to generate texts
  • Decoding process favors small model’s prediction and suppresses large model’s prediction
  • Removes tokens with negative score and renormalizes distribution
  • Decodes sequences with two models, 125M and 30B
  • Generates 50 tokens conditioned on 5 tokens of validation documents
  • Measures text perplexity at final and intermediate checkpoints

Analysis

  • Perplexity of texts generated with the p s − p t configuration increases as model size increases
  • Phenomenon is universal across other model families
  • Perplexity trajectory of texts generated with different configurations largely differ from each other
  • Small models are highly capable linguistically, and learning at scale primarily focuses on acquiring other types of knowledge

Manual design

  • Hypothesized that injecting noise into human texts might reverse scaling trend
  • Contrary to hypothesis, downward trends largely retain across all noise levels
  • Perplexity of correct and incorrect options of multiple-choice tasks decreases as model size increases
  • Initial attempt failed, not able to manually construct texts that are more probable in smaller models than larger models

Downstream tasks

  • Examined trajectory of downstream tasks evaluated on few-shot in-context learning
  • Used average 2-shot accuracy of downstream tasks as a proxy for in-context learning capability

Trajectory of icl performance

  • Smaller models consistently outperform larger models when given the same amount of training FLOPs
  • Downstream task performance correlates well with validation perplexity for all model sizes

Trajectory of option perplexity

  • 12 tasks present linearity scaling pattern, 6 tasks present breakthroughness scaling pattern
  • Performance of breakthroughness tasks increases tremendously when validation perplexity drops below 8
  • Accuracy of linearity tasks gradually increases
  • Improvements in downstream accuracy not driven by model assigning lower probability to incorrect candidates, but by perplexity divergence of correct and incorrect options

Conclusion

  • Validation perplexity serves as a strong indicator of OPT models’ behavior.
  • Large models reproduce small models’ behavior and unlock new capabilities.
  • Edge cases exist where models behave differently.
  • Techniques proposed can be extended to analyze language models trained with different resources and procedures.
  • Limitation of work is that it only analyzes models pre-trained with same data, similar training procedures, and same autoregressive language modeling objective.
  • Downstream task evaluation only done on multiple-choice tasks.
  • No concrete explanation for double-descent behavior during pre-training.
  • Subset of tokens that present an upward trend when selected by models of other sizes from the main paper.
  • Consistent occurrence of double-descent behavior along the trajectory.
  • Subset of tokens must embody certain intrinsic language properties.
  • Measurements plotted against FLOPs and validation perplexity.
  • At same level of validation perplexity, predictions for tokens similar across model scales.
  • Edge cases where underlying distributions of models at same level of perplexity differ.
  • Corrupt texts from opensubtitle subset of validation set by replacing p% tokens.
  • Larger language models better at exploiting random patterns than smaller counterparts.
  • Decoding approach similar to contrastive decoding method.
  • Subtraction in probability space eliminates false positive cases.