Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

LLMs have the capacity to learn in-context from example demonstrations.
In-context learning relies on recombination of compositional operations found in natural language data.
Theoretical predictions are validated by introducing a controlled setup for inducing in-context learning.
In-context learning emerges when scaling parameters and data.
Models perform better when prompted to output intermediate steps.
Probing shows that in-context learning is supported by a representation of the input’s compositional structure.

Paper Content

A formal learnability bound for learning from demonstrations

Learners have access to infinite training data and capacity
General linguistically-motivated assumptions are enough to guarantee ICL capabilities for language models

Setup

Pretraining data and few-shot tasks are generated from a finite universe of objects
A spellout map maps objects to their names
A formalism is proposed to describe the compositional structure and distribution of sentences and text
The formalism consists of a probabilistic context-free grammar and a yield operation
The yield operation takes into account a tree, attributes from the universe, and a source of randomness
Variables can be shared across subtrees and subtrees can be iterated
Documents are generated by sampling a tree and setting the yield to a string
Regularity assumptions are made about the formalism
Iteration complexity is associated with the formalism
Repetition is more complex than a single occurrence

Learnability bound

Provide in-context learning guarantees for an idealized predictor reflecting the distribution of documents sampled from a CAG
The predictive distribution is given as #d(x1…n)
The learning bound is in terms of the description length of defining a function within the CAG
Theorem 1 states that the summed zero-one loss on completing P1…Pn is bounded by an equation
The bound absorbs constants that depend on the PCFG backbone, but not on |Ω|
The bound is robust to changes in the prompt format
The function φ can also be taken as stochastic
The intuition of the proof is that an optimal predictive model M implicitly identifies the generative process underlying the prompt
The key quantities modulating the preference for the second explanation are D[τφ] and Rn
The key to in-context learning is the parameter Rn
Description length is assumed to be given by atomic nonterminals in the generative process
Theorem 1 is stated for prompts that simply concatenate inputs xi and outputs φ(xi)
Real-world LLMs can deal with other prompt formats
The theory predicts that naturalistic prompts are more successful than unnatural ones
Prompts assigned higher LLM likelihood tend to lead to better ICL results

Chain-of-thought prompting

Empirical research has observed that ICL for complex tasks benefits when models are prompted to provide intermediate steps before the answer
We formally study this in the context of computing composed functions
Chain-of-thought prompting corresponds to prompting the model to output an intermediate step and the result
Applying Theorem 1 to either direct prompting or a version with the intermediate step results in a bound depending on D[τ φ 1 •φ 2 ]
We prove a better bound for the chain-of-thought version, where the intermediate step is provided before the answer
The error in each of the two steps can be bounded individually by the description of only one function
The proof idea is that in each step, the other function can be effectively ignored in inferring the compositional process
We focus on the composition of two functions, but an analogous statement and proof hold for longer composition chains
Xie et al. [2022] model the pretraining data as a mixture of HMMs and cast ICL as Bayesian identification of one of these mixture components
Our analysis likewise can be understood in terms of Bayesian inference
We aim to account for the flexible and open-ended nature of prompting capabilities in LLMs by leveraging the compositional nature of natural-language data
Xie et al. [2022] focus their discussion of ICL on entity-property associations, while we explain ICL as identifying a task from an open-ended hypothesis space of tasks

Experiments

Training datasets

Information-theoretic analysis characterizing when broad ICL capabilities become possible for an idealized predictive model
Empirically verify whether the predicted behavior can emerge in transformer models pretrained on finite data sampled from a CAG
Define a suite of in-context learning tasks
Benchmark transformers pretrained on several types of controlled miniature datasets
Focus on functions of arity 1 for simplicity
Create documents with universe Ω = Σ and ω = ω
4 datasets: FVPROMPT, HMM5, HMMPERDOC, COMPOSITIONAL
Intuitively, COMPOSITIONAL dataset represents agents that have access to the world model and produce text according to arbitrary but compositionally structured instructions
Research questions: ICL emergence, predictions of Theorems 1-2, recombining operations, dynamics of emergence

Training setup

Trained GPT2-like models for next-token prediction
Varying numbers of dimensions, layers, and heads
CHAIN-OF-THOUGHT version produces intermediate step
Models need to identify task from prompt without instruction
No separate types for labels or separators
Focus on |Ω| = 30 and |F | = 10
Functions created randomly
Generated 500M tokens of training data
Data fed to model in portions of 64 tokens
Training performed for up to 20 epochs

Test tasks

Test tasks are defined using first-order logic formulas
Prompts are encoded into a prompt-based format
Tasks include function evaluation, inverse, and more complex formulas with two input variables
Tasks require reasoning about an unobserved variable or evaluating composed functions
Binary classification tasks require discriminating between two types of examples
Examples are balanced between classes
Model has no access to markup indicating components of the prompt

Results

Compositional training dataset enables ICL on composed tasks
Models trained on HMM5 cannot solve ICL tasks
Models trained on FVPROMPT can only solve function evaluation task
Models trained on HMMPERDOC achieve above-chance performance on PROPOSITIONAL tasks, but not on BINARY or COMPOSED tasks
Models trained on COMPOSITIONAL achieve near-perfect accuracy on FUNCTIONEVALUATION and other tasks with few literals, but not on BINARY tasks
Increasing |F| makes ICL harder, increasing |Ω| does not
Accuracy increases with prompt length
Accuracy follows a pattern of sudden emergence over the course of pretraining
LMs can recombine functions
CHAINOFTHOUGHT facilitates recombining abilities that were never used together in pretraining
Ablating loops or variable introduction makes ICL impossible
Ablating conditions has no discernible negative impact
Real-world LLMs show similar results

Representation learning supports icl

Our theoretical analysis argues that ICL relies on identifying the compositional generative process underlying a prompt
We provide evidence from the LM’s activations and attention patterns that they indeed induce the compositional structure underlying documents and prompts
We target the 21M parameters model as it has a small number of heads and layers and yet is successful on almost all tasks
We visualize attention patterns in a chain-of-thought example
We analyze attention patterns across a sample of 300 random documents in the training corpus
We intervene on the top-level attention heads in the trained model by masking out attention logits to non-structurally-corresponding positions
We hypothesize that the model learns to encode the logical relations holding among the tokens close by
We prove information-theoretic bounds for in-context learning in an optimal predictor
We prove a benefit for prompting models to provide intermediate steps
We found that emergence of many tasks coincided with improvement in structural representations
We account for the benefit of providing intermediate steps
We target an open-ended space of compositionally created test tasks
We suggest that ICL can work because prompts are compressible into compositional generation processes
We have links to Algorithmic Information Theory and the Minimum Description Length principle
We found that ICL can result by combining corpora which individually do not give rise to it
We found that term frequencies in the pretraining dataset had a strong impact on ICL performance
We found that data-distributional properties such as a Zipfian distribution over classes were beneficial to ICL of held-out classes
We found that varying the assignment of classes to labels in pretraining improved ICL
We found that ICL relies in part on attention heads attending to structurally related positions earlier in the sequence
We suggest that grokking relates to the build-up of generalizable representations
We suggest that transformers can implement optimization algorithms in context
We suggest that in-context learning relies on identifying the compositional structure underlying a prompt

Link to paper#

Abstract#

Paper Content#

A formal learnability bound for learning from demonstrations#

Setup#

Learnability bound#

Chain-of-thought prompting#

Experiments#

Training datasets#

Training setup#

Test tasks#

Results#

Representation learning supports icl#

Link to paper

Abstract

Paper Content

A formal learnability bound for learning from demonstrations

Setup

Learnability bound

Chain-of-thought prompting

Experiments

Training datasets

Training setup

Test tasks

Results

Representation learning supports icl