Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Scaling up language models has led to performance gains
Little is understood about how training dynamics change with model size
Analyzed intermediate training checkpoints of differently sized models
At a given perplexity, similar subset of training tokens see most significant reduction in loss
Early in training, all models learn to reduce perplexity of grammatical sequences with hallucinations
Perplexity is a strong predictor of in-context learning performance

Paper Content

Introduction

Scaling up language models improves language modeling perplexity and zero-or few-shot end task accuracies
Little is understood about why or how this happens
Study training trajectories of differently-sized OPT models
Analyze three aspects of model performance: next-token prediction, sequence-level generation, and downstream task performance
Find that language modeling perplexity correlates well with few-shot in-context learning performance

Experimental settings

OPT models used in experiments
Validation perplexity used to measure autoregressive language modeling
Validation set covers a wide range of domains
Trajectory of validation perplexity follows a power-law pattern

Next-token prediction

Autoregressive language models are used to predict the next token given a context.
Validation perplexity decreases as training progresses.
This section studies the trajectory of next-token predictions, divided into three categories.

Methodology

Combines signals from two models to generate texts
Decoding process favors small model’s prediction and suppresses large model’s prediction
Removes tokens with negative score and renormalizes distribution
Decodes sequences with two models, 125M and 30B
Generates 50 tokens conditioned on 5 tokens of validation documents
Measures text perplexity at final and intermediate checkpoints

Analysis

Perplexity of texts generated with the p s − p t configuration increases as model size increases
Phenomenon is universal across other model families
Perplexity trajectory of texts generated with different configurations largely differ from each other
Small models are highly capable linguistically, and learning at scale primarily focuses on acquiring other types of knowledge

Manual design

Hypothesized that injecting noise into human texts might reverse scaling trend
Contrary to hypothesis, downward trends largely retain across all noise levels
Perplexity of correct and incorrect options of multiple-choice tasks decreases as model size increases
Initial attempt failed, not able to manually construct texts that are more probable in smaller models than larger models

Downstream tasks

Examined trajectory of downstream tasks evaluated on few-shot in-context learning
Used average 2-shot accuracy of downstream tasks as a proxy for in-context learning capability

Trajectory of icl performance

Smaller models consistently outperform larger models when given the same amount of training FLOPs
Downstream task performance correlates well with validation perplexity for all model sizes

Trajectory of option perplexity

12 tasks present linearity scaling pattern, 6 tasks present breakthroughness scaling pattern
Performance of breakthroughness tasks increases tremendously when validation perplexity drops below 8
Accuracy of linearity tasks gradually increases
Improvements in downstream accuracy not driven by model assigning lower probability to incorrect candidates, but by perplexity divergence of correct and incorrect options

Conclusion

Validation perplexity serves as a strong indicator of OPT models’ behavior.
Large models reproduce small models’ behavior and unlock new capabilities.
Edge cases exist where models behave differently.
Techniques proposed can be extended to analyze language models trained with different resources and procedures.
Limitation of work is that it only analyzes models pre-trained with same data, similar training procedures, and same autoregressive language modeling objective.
Downstream task evaluation only done on multiple-choice tasks.
No concrete explanation for double-descent behavior during pre-training.
Subset of tokens that present an upward trend when selected by models of other sizes from the main paper.
Consistent occurrence of double-descent behavior along the trajectory.
Subset of tokens must embody certain intrinsic language properties.
Measurements plotted against FLOPs and validation perplexity.
At same level of validation perplexity, predictions for tokens similar across model scales.
Edge cases where underlying distributions of models at same level of perplexity differ.
Corrupt texts from opensubtitle subset of validation set by replacing p% tokens.
Larger language models better at exploiting random patterns than smaller counterparts.
Decoding approach similar to contrastive decoding method.
Subtraction in probability space eliminates false positive cases.

Link to paper#

Abstract#

Paper Content#

Introduction#

Experimental settings#

Next-token prediction#

Methodology#

Analysis#

Manual design#

Downstream tasks#

Trajectory of icl performance#

Trajectory of option perplexity#

Related work#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Experimental settings

Next-token prediction

Methodology

Analysis

Manual design

Downstream tasks

Trajectory of icl performance

Trajectory of option perplexity

Related work

Conclusion