Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • LLMs have the capacity to learn in-context from example demonstrations.
  • In-context learning relies on recombination of compositional operations found in natural language data.
  • Theoretical predictions are validated by introducing a controlled setup for inducing in-context learning.
  • In-context learning emerges when scaling parameters and data.
  • Models perform better when prompted to output intermediate steps.
  • Probing shows that in-context learning is supported by a representation of the input’s compositional structure.

Paper Content

A formal learnability bound for learning from demonstrations

  • Learners have access to infinite training data and capacity
  • General linguistically-motivated assumptions are enough to guarantee ICL capabilities for language models

Setup

  • Pretraining data and few-shot tasks are generated from a finite universe of objects
  • A spellout map maps objects to their names
  • A formalism is proposed to describe the compositional structure and distribution of sentences and text
  • The formalism consists of a probabilistic context-free grammar and a yield operation
  • The yield operation takes into account a tree, attributes from the universe, and a source of randomness
  • Variables can be shared across subtrees and subtrees can be iterated
  • Documents are generated by sampling a tree and setting the yield to a string
  • Regularity assumptions are made about the formalism
  • Iteration complexity is associated with the formalism
  • Repetition is more complex than a single occurrence

Learnability bound

  • Provide in-context learning guarantees for an idealized predictor reflecting the distribution of documents sampled from a CAG
  • The predictive distribution is given as #d(x1…n)
  • The learning bound is in terms of the description length of defining a function within the CAG
  • Theorem 1 states that the summed zero-one loss on completing P1…Pn is bounded by an equation
  • The bound absorbs constants that depend on the PCFG backbone, but not on |Ω|
  • The bound is robust to changes in the prompt format
  • The function φ can also be taken as stochastic
  • The intuition of the proof is that an optimal predictive model M implicitly identifies the generative process underlying the prompt
  • The key quantities modulating the preference for the second explanation are D[τφ] and Rn
  • The key to in-context learning is the parameter Rn
  • Description length is assumed to be given by atomic nonterminals in the generative process
  • Theorem 1 is stated for prompts that simply concatenate inputs xi and outputs φ(xi)
  • Real-world LLMs can deal with other prompt formats
  • The theory predicts that naturalistic prompts are more successful than unnatural ones
  • Prompts assigned higher LLM likelihood tend to lead to better ICL results

Chain-of-thought prompting

  • Empirical research has observed that ICL for complex tasks benefits when models are prompted to provide intermediate steps before the answer
  • We formally study this in the context of computing composed functions
  • Chain-of-thought prompting corresponds to prompting the model to output an intermediate step and the result
  • Applying Theorem 1 to either direct prompting or a version with the intermediate step results in a bound depending on D[τ φ 1 •φ 2 ]
  • We prove a better bound for the chain-of-thought version, where the intermediate step is provided before the answer
  • The error in each of the two steps can be bounded individually by the description of only one function
  • The proof idea is that in each step, the other function can be effectively ignored in inferring the compositional process
  • We focus on the composition of two functions, but an analogous statement and proof hold for longer composition chains
  • Xie et al. [2022] model the pretraining data as a mixture of HMMs and cast ICL as Bayesian identification of one of these mixture components
  • Our analysis likewise can be understood in terms of Bayesian inference
  • We aim to account for the flexible and open-ended nature of prompting capabilities in LLMs by leveraging the compositional nature of natural-language data
  • Xie et al. [2022] focus their discussion of ICL on entity-property associations, while we explain ICL as identifying a task from an open-ended hypothesis space of tasks

Experiments

Training datasets

  • Information-theoretic analysis characterizing when broad ICL capabilities become possible for an idealized predictive model
  • Empirically verify whether the predicted behavior can emerge in transformer models pretrained on finite data sampled from a CAG
  • Define a suite of in-context learning tasks
  • Benchmark transformers pretrained on several types of controlled miniature datasets
  • Focus on functions of arity 1 for simplicity
  • Create documents with universe Ω = Σ and ω = ω
  • 4 datasets: FVPROMPT, HMM5, HMMPERDOC, COMPOSITIONAL
  • Intuitively, COMPOSITIONAL dataset represents agents that have access to the world model and produce text according to arbitrary but compositionally structured instructions
  • Research questions: ICL emergence, predictions of Theorems 1-2, recombining operations, dynamics of emergence

Training setup

  • Trained GPT2-like models for next-token prediction
  • Varying numbers of dimensions, layers, and heads
  • CHAIN-OF-THOUGHT version produces intermediate step
  • Models need to identify task from prompt without instruction
  • No separate types for labels or separators
  • Focus on |Ω| = 30 and |F | = 10
  • Functions created randomly
  • Generated 500M tokens of training data
  • Data fed to model in portions of 64 tokens
  • Training performed for up to 20 epochs

Test tasks

  • Test tasks are defined using first-order logic formulas
  • Prompts are encoded into a prompt-based format
  • Tasks include function evaluation, inverse, and more complex formulas with two input variables
  • Tasks require reasoning about an unobserved variable or evaluating composed functions
  • Binary classification tasks require discriminating between two types of examples
  • Examples are balanced between classes
  • Model has no access to markup indicating components of the prompt

Results

  • Compositional training dataset enables ICL on composed tasks
  • Models trained on HMM5 cannot solve ICL tasks
  • Models trained on FVPROMPT can only solve function evaluation task
  • Models trained on HMMPERDOC achieve above-chance performance on PROPOSITIONAL tasks, but not on BINARY or COMPOSED tasks
  • Models trained on COMPOSITIONAL achieve near-perfect accuracy on FUNCTIONEVALUATION and other tasks with few literals, but not on BINARY tasks
  • Increasing |F| makes ICL harder, increasing |Ω| does not
  • Accuracy increases with prompt length
  • Accuracy follows a pattern of sudden emergence over the course of pretraining
  • LMs can recombine functions
  • CHAINOFTHOUGHT facilitates recombining abilities that were never used together in pretraining
  • Ablating loops or variable introduction makes ICL impossible
  • Ablating conditions has no discernible negative impact
  • Real-world LLMs show similar results

Representation learning supports icl

  • Our theoretical analysis argues that ICL relies on identifying the compositional generative process underlying a prompt
  • We provide evidence from the LM’s activations and attention patterns that they indeed induce the compositional structure underlying documents and prompts
  • We target the 21M parameters model as it has a small number of heads and layers and yet is successful on almost all tasks
  • We visualize attention patterns in a chain-of-thought example
  • We analyze attention patterns across a sample of 300 random documents in the training corpus
  • We intervene on the top-level attention heads in the trained model by masking out attention logits to non-structurally-corresponding positions
  • We hypothesize that the model learns to encode the logical relations holding among the tokens close by
  • We prove information-theoretic bounds for in-context learning in an optimal predictor
  • We prove a benefit for prompting models to provide intermediate steps
  • We found that emergence of many tasks coincided with improvement in structural representations
  • We account for the benefit of providing intermediate steps
  • We target an open-ended space of compositionally created test tasks
  • We suggest that ICL can work because prompts are compressible into compositional generation processes
  • We have links to Algorithmic Information Theory and the Minimum Description Length principle
  • We found that ICL can result by combining corpora which individually do not give rise to it
  • We found that term frequencies in the pretraining dataset had a strong impact on ICL performance
  • We found that data-distributional properties such as a Zipfian distribution over classes were beneficial to ICL of held-out classes
  • We found that varying the assignment of classes to labels in pretraining improved ICL
  • We found that ICL relies in part on attention heads attending to structurally related positions earlier in the sequence
  • We suggest that grokking relates to the build-up of generalizable representations
  • We suggest that transformers can implement optimization algorithms in context
  • We suggest that in-context learning relies on identifying the compositional structure underlying a prompt