Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Proposed Quantization Model explains power law dropoff of loss with model and data size
  • Quantization Hypothesis states that learned network capabilities are quantized into discrete chunks
  • Power law in use frequencies explains observed power law scaling of loss
  • Validated prediction on toy datasets and studied scaling curves for large language models

Paper Content

Introduction

  • Larger neural networks trained on more data perform better than smaller neural networks trained on less data
  • Mean test loss decreases as a power law in both the number of network parameters and the number of training samples
  • Larger models often have emergent abilities, i.e. unexpected and qualitatively different behavior than smaller models
  • Understanding both facets of scaling is relevant to the future of deep learning
  • Recent studies of the internal workings of neural networks have found a variety of impressive algorithms learned by gradient descent
  • A natural question is whether such circuits are learned universally across models with different random initializations and across scales
  • The task of mechanistic interpretability may help to understand what deep neural networks are doing internally
  • The Quantization Hypothesis suggests that model performance is determined by which of these computations are successfully learned
  • A power law distribution over subtasks in data may lead to a power law marginal improvement in loss from learning additional quanta
  • If correct, the Quantization Hypothesis could have many implications for understanding neural networks

Theory

  • Modeling the distribution of text on the internet requires knowledge and diverse computations
  • Prediction of text requires computations to be present in models
  • Quantization Hypothesis: many natural prediction problems involve a discrete set of computations which are natural to learn and instrumental for reducing loss
  • Some abilities are more useful for reducing loss than others, leading to a natural ordering of the quanta
  • Scaling performance is determined by how many quanta are successfully learned
  • Quanta frequencies follow a power law
  • Loss is a function of quanta learned
  • Quanta are either monogenic or polygenic
  • Parameter scaling: network capacity is a bottleneck, number of quanta learned is proportional to model size
  • Data scaling: threshold of examples needed for quantum to be learned, power law distribution over quanta produces power law data scaling
  • Multi-epoch training: rate of convergence of SGD can bottleneck performance
  • Single-epoch training: number of quanta learned is proportional to number of training steps
  • Prior work: models of power law scaling w.r.t. model parameters, data feature-feature covariance matrix, features seen during training

Proof of concept: a toy dataset

  • Toy dataset consists of distinct subtasks with power law distribution in frequency
  • Power law neural scaling observed in data and parameters
  • Mechanism of neural scaling coincides with theory from Section 2
  • Possible for power law neural scaling to arise from Quantization Model

The “multitask sparse parity” dataset

  • The toy task consists of many subtasks, each of which is a variant of the “sparse parity” problem.
  • The sparse parity prediction problem is to compute the parity of a fixed subset of bits from a bit string of length n.
  • The multitask sparse parity task adds an additional parameter, the number of subtasks.
  • Input bit strings are length n tasks + n, with the first n tasks bits being the control bits and the last n bits being the task bits.

Power law scaling and emergence

  • Trained ReLU MLPs with single hidden layer to solve task
  • Input dimension is n tasks + n, output dimension is 2
  • Adam optimizer with learning rate of 10-3
  • Varying width of network by sampling batches online
  • Loss follows reverse-S curve, undergoing “phase transition”
  • Mean loss decreases as power law with α N ≈ α and α D ≈ α/(α + 1)
  • Scaling w.r.t. parameters is noisier than data scaling
  • Rough scale of data or parameters below which networks do not learn task
  • Smooth power law scaling averages over many phase transitions

Decomposing empirical llm scaling

  • Experiments were conducted using the “Pythia” model sequence from Eleuther
  • The models were decoder-only transformers of varying size trained on the same data
  • Seven models were evaluated on approximately 10 million tokens from the test set of The Pile
  • Cross-entropy loss was recorded on every token
  • Studied how neural scaling decomposes by looking at how the distribution of losses changes with scale

The distribution over per-token losses

  • Mean loss of first 6 models in Pythia sequence fit by power law with exponent 0.083
  • Probability distribution over per-token losses shows most losses close to zero
  • Increasing model size increases portion of approximately-zero losses

A taxonomy: monogenic versus polygenic behaviors

  • Quantization Hypothesis and multitask sparse parity study suggest that network performance benefits from a single quanta
  • Scaling curves on individual examples show emergence
  • Large language models show a variety of scaling behaviors
  • Prediction problems can be monogenic or polygenic, with polygenicity forming a spectrum

Auto-discovering quanta with language model internals

  • Attempt to auto-discover quanta in language modeling
  • Partitioning inputs/outputs based on context is suboptimal
  • Use internals of trained language models to cluster samples
  • Quanta Discovery from Gradients (QDG) method proposed
  • QDG clusters samples based on gradients pointing in similar directions
  • 10000 samples chosen from The Pile with cross-entropy loss < 0.1 nats
  • Clusters involve predicting same token for coherent reason
  • Clusters for more abstract prediction rules
  • Power law distribution over quanta utilization frequency
  • Measured power law exponent ≈ -1.24

Discussion

  • Quantization Model of neural scaling laws relies on Quantization Hypothesis
  • Quantization Hypothesis posits that neural network performance can be understood with respect to a discrete set of computations and capabilities
  • Quanta of prediction problem sorted into Q Sequence according to how frequently they are used for prediction
  • Power law neural scaling when use frequencies of quanta given by power law
  • Multitask sparse parity problem supports Quantization Hypothesis
  • QDG method used to decompose LLM scaling curves and auto-discover quanta for language prediction
  • Quantization Hypothesis suggests linearity or breakthroughness of a task influenced by distribution of quanta relevant to the task
  • Mechanistic Interpretability: understanding neural networks could reduce to enumerating quanta
  • Science of Deep Learning: study how engineering choices influence building blocks of model performance - the quanta

B additional results on multitask sparse parity

  • Training dynamics: Loss decreases as an average over multiple reverse-S shaped curves
  • Scaling for varying α: Power law scaling observed, but not precisely as predicted
  • Parameter scaling: Relationship between α N and α deviates from prediction
  • Step scaling: α S is consistently higher than theoretical prediction

C additional results on language models

  • Figure 11 shows examples from clusters discovered with QDG.
  • QDG is a computer science tool.

D the difficulty of estimating the power law exponent from clusters

  • Distribution over elements in each cluster did not perfectly recover expected Zipf distribution
  • Difficulty of accurately estimating Zipf distribution exponent with method

D.1 qdg on multitask sparse parity

  • Performed QDG on multitask sparse parity
  • Trained a width-500 single-hidden-layer ReLU
  • Network achieved ≈ 0 loss
  • Did not recover power law from samples
  • Clear pattern where elements from same subtask have higher angular similarity
  • Rank-frequency plot of clusters did not recover a slope of -1.4

D.2 a toy model of qdg uncertainty and bias

  • Toy model developed to understand bias of spectral clustering
  • Model assumes dataset has 1000 subtasks, each with a Gaussian distribution
  • Similarity between two vectors is computed and input to spectral clustering algorithm
  • Hyperparameters of toy model are embedding dimension and noise level
  • High-dimension (d = 1000) large-noise (σ = 2.0) scheme best agrees with LLM results
  • Estimating α from frequency curve is hard, envelope slope indicates α

E parameter and data scaling exponents across studies

  • α N and α D (or possibly α S ) are shown for a variety of prior studies of deep learning scaling
  • Scaling in model parameters (N ), training samples (D), and training time (S) can translate into scaling in n and therefore loss L
  • Neural networks exhibit power law neural scaling parameters N , training time S, and training samples D
  • Scaling behavior on individual subtasks exhibits emergence
  • Scaling of mean test loss w.r.t. non-embedding parameters for the Eleuther Pythia models
  • Distribution p(L) over losses on individual tokens for models of different size
  • Training curves (scaling w.r.t. steps S) of mean test loss for Pythia models
  • Distribution p(L) over time
  • Distribution L • p(L) over time
  • Scaling on individual tokens can have diverse behavior
  • Comparing different scaling laws
  • Training dynamics on the multitask sparse parity dataset consist of many “phase transitions”
  • Number of subtasks learned (n) versus training samples D for a variety of α
  • Scaling in parameters (N ), single-epoch training time (S), and multi-epoch training samples (D) for varying quanta power law distribution parameter α
  • To understand the bias of spectral clustering, apply spectral clustering to a toy model
  • Parameter and data scaling exponents from various studies of deep learning scaling
  • Difficulty of measuring α from curves