Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Scaling up language models has led to performance gains
- Little is understood about how training dynamics change with model size
- Analyzed intermediate training checkpoints of differently sized models
- At a given perplexity, similar subset of training tokens see most significant reduction in loss
- Early in training, all models learn to reduce perplexity of grammatical sequences with hallucinations
- Perplexity is a strong predictor of in-context learning performance
Paper Content
Introduction
- Scaling up language models improves language modeling perplexity and zero-or few-shot end task accuracies
- Little is understood about why or how this happens
- Study training trajectories of differently-sized OPT models
- Analyze three aspects of model performance: next-token prediction, sequence-level generation, and downstream task performance
- Find that language modeling perplexity correlates well with few-shot in-context learning performance
Experimental settings
- OPT models used in experiments
- Validation perplexity used to measure autoregressive language modeling
- Validation set covers a wide range of domains
- Trajectory of validation perplexity follows a power-law pattern
Next-token prediction
- Autoregressive language models are used to predict the next token given a context.
- Validation perplexity decreases as training progresses.
- This section studies the trajectory of next-token predictions, divided into three categories.
Methodology
- Combines signals from two models to generate texts
- Decoding process favors small model’s prediction and suppresses large model’s prediction
- Removes tokens with negative score and renormalizes distribution
- Decodes sequences with two models, 125M and 30B
- Generates 50 tokens conditioned on 5 tokens of validation documents
- Measures text perplexity at final and intermediate checkpoints
Analysis
- Perplexity of texts generated with the p s − p t configuration increases as model size increases
- Phenomenon is universal across other model families
- Perplexity trajectory of texts generated with different configurations largely differ from each other
- Small models are highly capable linguistically, and learning at scale primarily focuses on acquiring other types of knowledge
Manual design
- Hypothesized that injecting noise into human texts might reverse scaling trend
- Contrary to hypothesis, downward trends largely retain across all noise levels
- Perplexity of correct and incorrect options of multiple-choice tasks decreases as model size increases
- Initial attempt failed, not able to manually construct texts that are more probable in smaller models than larger models
Downstream tasks
- Examined trajectory of downstream tasks evaluated on few-shot in-context learning
- Used average 2-shot accuracy of downstream tasks as a proxy for in-context learning capability
Trajectory of icl performance
- Smaller models consistently outperform larger models when given the same amount of training FLOPs
- Downstream task performance correlates well with validation perplexity for all model sizes
Trajectory of option perplexity
- 12 tasks present linearity scaling pattern, 6 tasks present breakthroughness scaling pattern
- Performance of breakthroughness tasks increases tremendously when validation perplexity drops below 8
- Accuracy of linearity tasks gradually increases
- Improvements in downstream accuracy not driven by model assigning lower probability to incorrect candidates, but by perplexity divergence of correct and incorrect options
Related work
Conclusion
- Validation perplexity serves as a strong indicator of OPT models’ behavior.
- Large models reproduce small models’ behavior and unlock new capabilities.
- Edge cases exist where models behave differently.
- Techniques proposed can be extended to analyze language models trained with different resources and procedures.
- Limitation of work is that it only analyzes models pre-trained with same data, similar training procedures, and same autoregressive language modeling objective.
- Downstream task evaluation only done on multiple-choice tasks.
- No concrete explanation for double-descent behavior during pre-training.
- Subset of tokens that present an upward trend when selected by models of other sizes from the main paper.
- Consistent occurrence of double-descent behavior along the trajectory.
- Subset of tokens must embody certain intrinsic language properties.
- Measurements plotted against FLOPs and validation perplexity.
- At same level of validation perplexity, predictions for tokens similar across model scales.
- Edge cases where underlying distributions of models at same level of perplexity differ.
- Corrupt texts from opensubtitle subset of validation set by replacing p% tokens.
- Larger language models better at exploiting random patterns than smaller counterparts.
- Decoding approach similar to contrastive decoding method.
- Subtraction in probability space eliminates false positive cases.