Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Proposed Quantization Model explains power law dropoff of loss with model and data size
- Quantization Hypothesis states that learned network capabilities are quantized into discrete chunks
- Power law in use frequencies explains observed power law scaling of loss
- Validated prediction on toy datasets and studied scaling curves for large language models
Paper Content
Introduction
- Larger neural networks trained on more data perform better than smaller neural networks trained on less data
- Mean test loss decreases as a power law in both the number of network parameters and the number of training samples
- Larger models often have emergent abilities, i.e. unexpected and qualitatively different behavior than smaller models
- Understanding both facets of scaling is relevant to the future of deep learning
- Recent studies of the internal workings of neural networks have found a variety of impressive algorithms learned by gradient descent
- A natural question is whether such circuits are learned universally across models with different random initializations and across scales
- The task of mechanistic interpretability may help to understand what deep neural networks are doing internally
- The Quantization Hypothesis suggests that model performance is determined by which of these computations are successfully learned
- A power law distribution over subtasks in data may lead to a power law marginal improvement in loss from learning additional quanta
- If correct, the Quantization Hypothesis could have many implications for understanding neural networks
Theory
- Modeling the distribution of text on the internet requires knowledge and diverse computations
- Prediction of text requires computations to be present in models
- Quantization Hypothesis: many natural prediction problems involve a discrete set of computations which are natural to learn and instrumental for reducing loss
- Some abilities are more useful for reducing loss than others, leading to a natural ordering of the quanta
- Scaling performance is determined by how many quanta are successfully learned
- Quanta frequencies follow a power law
- Loss is a function of quanta learned
- Quanta are either monogenic or polygenic
- Parameter scaling: network capacity is a bottleneck, number of quanta learned is proportional to model size
- Data scaling: threshold of examples needed for quantum to be learned, power law distribution over quanta produces power law data scaling
- Multi-epoch training: rate of convergence of SGD can bottleneck performance
- Single-epoch training: number of quanta learned is proportional to number of training steps
- Prior work: models of power law scaling w.r.t. model parameters, data feature-feature covariance matrix, features seen during training
Proof of concept: a toy dataset
- Toy dataset consists of distinct subtasks with power law distribution in frequency
- Power law neural scaling observed in data and parameters
- Mechanism of neural scaling coincides with theory from Section 2
- Possible for power law neural scaling to arise from Quantization Model
The “multitask sparse parity” dataset
- The toy task consists of many subtasks, each of which is a variant of the “sparse parity” problem.
- The sparse parity prediction problem is to compute the parity of a fixed subset of bits from a bit string of length n.
- The multitask sparse parity task adds an additional parameter, the number of subtasks.
- Input bit strings are length n tasks + n, with the first n tasks bits being the control bits and the last n bits being the task bits.
Power law scaling and emergence
- Trained ReLU MLPs with single hidden layer to solve task
- Input dimension is n tasks + n, output dimension is 2
- Adam optimizer with learning rate of 10-3
- Varying width of network by sampling batches online
- Loss follows reverse-S curve, undergoing “phase transition”
- Mean loss decreases as power law with α N ≈ α and α D ≈ α/(α + 1)
- Scaling w.r.t. parameters is noisier than data scaling
- Rough scale of data or parameters below which networks do not learn task
- Smooth power law scaling averages over many phase transitions
Decomposing empirical llm scaling
- Experiments were conducted using the “Pythia” model sequence from Eleuther
- The models were decoder-only transformers of varying size trained on the same data
- Seven models were evaluated on approximately 10 million tokens from the test set of The Pile
- Cross-entropy loss was recorded on every token
- Studied how neural scaling decomposes by looking at how the distribution of losses changes with scale
The distribution over per-token losses
- Mean loss of first 6 models in Pythia sequence fit by power law with exponent 0.083
- Probability distribution over per-token losses shows most losses close to zero
- Increasing model size increases portion of approximately-zero losses
A taxonomy: monogenic versus polygenic behaviors
- Quantization Hypothesis and multitask sparse parity study suggest that network performance benefits from a single quanta
- Scaling curves on individual examples show emergence
- Large language models show a variety of scaling behaviors
- Prediction problems can be monogenic or polygenic, with polygenicity forming a spectrum
Auto-discovering quanta with language model internals
- Attempt to auto-discover quanta in language modeling
- Partitioning inputs/outputs based on context is suboptimal
- Use internals of trained language models to cluster samples
- Quanta Discovery from Gradients (QDG) method proposed
- QDG clusters samples based on gradients pointing in similar directions
- 10000 samples chosen from The Pile with cross-entropy loss < 0.1 nats
- Clusters involve predicting same token for coherent reason
- Clusters for more abstract prediction rules
- Power law distribution over quanta utilization frequency
- Measured power law exponent ≈ -1.24
Discussion
- Quantization Model of neural scaling laws relies on Quantization Hypothesis
- Quantization Hypothesis posits that neural network performance can be understood with respect to a discrete set of computations and capabilities
- Quanta of prediction problem sorted into Q Sequence according to how frequently they are used for prediction
- Power law neural scaling when use frequencies of quanta given by power law
- Multitask sparse parity problem supports Quantization Hypothesis
- QDG method used to decompose LLM scaling curves and auto-discover quanta for language prediction
- Quantization Hypothesis suggests linearity or breakthroughness of a task influenced by distribution of quanta relevant to the task
- Mechanistic Interpretability: understanding neural networks could reduce to enumerating quanta
- Science of Deep Learning: study how engineering choices influence building blocks of model performance - the quanta
B additional results on multitask sparse parity
- Training dynamics: Loss decreases as an average over multiple reverse-S shaped curves
- Scaling for varying α: Power law scaling observed, but not precisely as predicted
- Parameter scaling: Relationship between α N and α deviates from prediction
- Step scaling: α S is consistently higher than theoretical prediction
C additional results on language models
- Figure 11 shows examples from clusters discovered with QDG.
- QDG is a computer science tool.
D the difficulty of estimating the power law exponent from clusters
- Distribution over elements in each cluster did not perfectly recover expected Zipf distribution
- Difficulty of accurately estimating Zipf distribution exponent with method
D.1 qdg on multitask sparse parity
- Performed QDG on multitask sparse parity
- Trained a width-500 single-hidden-layer ReLU
- Network achieved ≈ 0 loss
- Did not recover power law from samples
- Clear pattern where elements from same subtask have higher angular similarity
- Rank-frequency plot of clusters did not recover a slope of -1.4
D.2 a toy model of qdg uncertainty and bias
- Toy model developed to understand bias of spectral clustering
- Model assumes dataset has 1000 subtasks, each with a Gaussian distribution
- Similarity between two vectors is computed and input to spectral clustering algorithm
- Hyperparameters of toy model are embedding dimension and noise level
- High-dimension (d = 1000) large-noise (σ = 2.0) scheme best agrees with LLM results
- Estimating α from frequency curve is hard, envelope slope indicates α
E parameter and data scaling exponents across studies
- α N and α D (or possibly α S ) are shown for a variety of prior studies of deep learning scaling
- Scaling in model parameters (N ), training samples (D), and training time (S) can translate into scaling in n and therefore loss L
- Neural networks exhibit power law neural scaling parameters N , training time S, and training samples D
- Scaling behavior on individual subtasks exhibits emergence
- Scaling of mean test loss w.r.t. non-embedding parameters for the Eleuther Pythia models
- Distribution p(L) over losses on individual tokens for models of different size
- Training curves (scaling w.r.t. steps S) of mean test loss for Pythia models
- Distribution p(L) over time
- Distribution L • p(L) over time
- Scaling on individual tokens can have diverse behavior
- Comparing different scaling laws
- Training dynamics on the multitask sparse parity dataset consist of many “phase transitions”
- Number of subtasks learned (n) versus training samples D for a variety of α
- Scaling in parameters (N ), single-epoch training time (S), and multi-epoch training samples (D) for varying quanta power law distribution parameter α
- To understand the bias of spectral clustering, apply spectral clustering to a toy model
- Parameter and data scaling exponents from various studies of deep learning scaling
- Difficulty of measuring α from curves