Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- LLMs have the capacity to learn in-context from example demonstrations.
- In-context learning relies on recombination of compositional operations found in natural language data.
- Theoretical predictions are validated by introducing a controlled setup for inducing in-context learning.
- In-context learning emerges when scaling parameters and data.
- Models perform better when prompted to output intermediate steps.
- Probing shows that in-context learning is supported by a representation of the input’s compositional structure.
Paper Content
A formal learnability bound for learning from demonstrations
- Learners have access to infinite training data and capacity
- General linguistically-motivated assumptions are enough to guarantee ICL capabilities for language models
Setup
- Pretraining data and few-shot tasks are generated from a finite universe of objects
- A spellout map maps objects to their names
- A formalism is proposed to describe the compositional structure and distribution of sentences and text
- The formalism consists of a probabilistic context-free grammar and a yield operation
- The yield operation takes into account a tree, attributes from the universe, and a source of randomness
- Variables can be shared across subtrees and subtrees can be iterated
- Documents are generated by sampling a tree and setting the yield to a string
- Regularity assumptions are made about the formalism
- Iteration complexity is associated with the formalism
- Repetition is more complex than a single occurrence
Learnability bound
- Provide in-context learning guarantees for an idealized predictor reflecting the distribution of documents sampled from a CAG
- The predictive distribution is given as #d(x1…n)
- The learning bound is in terms of the description length of defining a function within the CAG
- Theorem 1 states that the summed zero-one loss on completing P1…Pn is bounded by an equation
- The bound absorbs constants that depend on the PCFG backbone, but not on |Ω|
- The bound is robust to changes in the prompt format
- The function φ can also be taken as stochastic
- The intuition of the proof is that an optimal predictive model M implicitly identifies the generative process underlying the prompt
- The key quantities modulating the preference for the second explanation are D[τφ] and Rn
- The key to in-context learning is the parameter Rn
- Description length is assumed to be given by atomic nonterminals in the generative process
- Theorem 1 is stated for prompts that simply concatenate inputs xi and outputs φ(xi)
- Real-world LLMs can deal with other prompt formats
- The theory predicts that naturalistic prompts are more successful than unnatural ones
- Prompts assigned higher LLM likelihood tend to lead to better ICL results
Chain-of-thought prompting
- Empirical research has observed that ICL for complex tasks benefits when models are prompted to provide intermediate steps before the answer
- We formally study this in the context of computing composed functions
- Chain-of-thought prompting corresponds to prompting the model to output an intermediate step and the result
- Applying Theorem 1 to either direct prompting or a version with the intermediate step results in a bound depending on D[τ φ 1 •φ 2 ]
- We prove a better bound for the chain-of-thought version, where the intermediate step is provided before the answer
- The error in each of the two steps can be bounded individually by the description of only one function
- The proof idea is that in each step, the other function can be effectively ignored in inferring the compositional process
- We focus on the composition of two functions, but an analogous statement and proof hold for longer composition chains
- Xie et al. [2022] model the pretraining data as a mixture of HMMs and cast ICL as Bayesian identification of one of these mixture components
- Our analysis likewise can be understood in terms of Bayesian inference
- We aim to account for the flexible and open-ended nature of prompting capabilities in LLMs by leveraging the compositional nature of natural-language data
- Xie et al. [2022] focus their discussion of ICL on entity-property associations, while we explain ICL as identifying a task from an open-ended hypothesis space of tasks
Experiments
Training datasets
- Information-theoretic analysis characterizing when broad ICL capabilities become possible for an idealized predictive model
- Empirically verify whether the predicted behavior can emerge in transformer models pretrained on finite data sampled from a CAG
- Define a suite of in-context learning tasks
- Benchmark transformers pretrained on several types of controlled miniature datasets
- Focus on functions of arity 1 for simplicity
- Create documents with universe Ω = Σ and ω = ω
- 4 datasets: FVPROMPT, HMM5, HMMPERDOC, COMPOSITIONAL
- Intuitively, COMPOSITIONAL dataset represents agents that have access to the world model and produce text according to arbitrary but compositionally structured instructions
- Research questions: ICL emergence, predictions of Theorems 1-2, recombining operations, dynamics of emergence
Training setup
- Trained GPT2-like models for next-token prediction
- Varying numbers of dimensions, layers, and heads
- CHAIN-OF-THOUGHT version produces intermediate step
- Models need to identify task from prompt without instruction
- No separate types for labels or separators
- Focus on |Ω| = 30 and |F | = 10
- Functions created randomly
- Generated 500M tokens of training data
- Data fed to model in portions of 64 tokens
- Training performed for up to 20 epochs
Test tasks
- Test tasks are defined using first-order logic formulas
- Prompts are encoded into a prompt-based format
- Tasks include function evaluation, inverse, and more complex formulas with two input variables
- Tasks require reasoning about an unobserved variable or evaluating composed functions
- Binary classification tasks require discriminating between two types of examples
- Examples are balanced between classes
- Model has no access to markup indicating components of the prompt
Results
- Compositional training dataset enables ICL on composed tasks
- Models trained on HMM5 cannot solve ICL tasks
- Models trained on FVPROMPT can only solve function evaluation task
- Models trained on HMMPERDOC achieve above-chance performance on PROPOSITIONAL tasks, but not on BINARY or COMPOSED tasks
- Models trained on COMPOSITIONAL achieve near-perfect accuracy on FUNCTIONEVALUATION and other tasks with few literals, but not on BINARY tasks
- Increasing |F| makes ICL harder, increasing |Ω| does not
- Accuracy increases with prompt length
- Accuracy follows a pattern of sudden emergence over the course of pretraining
- LMs can recombine functions
- CHAINOFTHOUGHT facilitates recombining abilities that were never used together in pretraining
- Ablating loops or variable introduction makes ICL impossible
- Ablating conditions has no discernible negative impact
- Real-world LLMs show similar results
Representation learning supports icl
- Our theoretical analysis argues that ICL relies on identifying the compositional generative process underlying a prompt
- We provide evidence from the LM’s activations and attention patterns that they indeed induce the compositional structure underlying documents and prompts
- We target the 21M parameters model as it has a small number of heads and layers and yet is successful on almost all tasks
- We visualize attention patterns in a chain-of-thought example
- We analyze attention patterns across a sample of 300 random documents in the training corpus
- We intervene on the top-level attention heads in the trained model by masking out attention logits to non-structurally-corresponding positions
- We hypothesize that the model learns to encode the logical relations holding among the tokens close by
- We prove information-theoretic bounds for in-context learning in an optimal predictor
- We prove a benefit for prompting models to provide intermediate steps
- We found that emergence of many tasks coincided with improvement in structural representations
- We account for the benefit of providing intermediate steps
- We target an open-ended space of compositionally created test tasks
- We suggest that ICL can work because prompts are compressible into compositional generation processes
- We have links to Algorithmic Information Theory and the Minimum Description Length principle
- We found that ICL can result by combining corpora which individually do not give rise to it
- We found that term frequencies in the pretraining dataset had a strong impact on ICL performance
- We found that data-distributional properties such as a Zipfian distribution over classes were beneficial to ICL of held-out classes
- We found that varying the assignment of classes to labels in pretraining improved ICL
- We found that ICL relies in part on attention heads attending to structurally related positions earlier in the sequence
- We suggest that grokking relates to the build-up of generalizable representations
- We suggest that transformers can implement optimization algorithms in context
- We suggest that in-context learning relies on identifying the compositional structure underlying a prompt