Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • In-context learning (ICL) is a type of prompting where a transformer model operates on a sequence of (input, output) examples and performs inference on-the-fly.
  • ICL is an algorithm learning problem, treating the transformer model as a learning algorithm that can be specialized via training.
  • Multitask learning is used to obtain generalization bounds for ICL.
  • Transformers can act as an adaptive learning algorithm and perform model selection across different hypothesis classes.

Paper Content

Problem setup

  • Let X be the input feature space and Y be the output/label space
  • Use boldface for vector variables
  • [] denotes a set of numbers
  • ℓ denotes the ℓ-norm
  • Prompts are obtained from a sequence of i.i.d. (input, label) pairs or a trajectory of a dynamical system
  • Model aims to predict the next output ŷ
  • Training phase of ICL involves multitask learning
  • Training sequences S = (z ) =1 are drawn from a data distribution
  • Model is abstracted as a learning algorithm that maps a sequence of data to a prediction function
  • Training objective is formulated as searching for the optimal algorithm Alg ∈ A
  • Task-specific loss L (Alg) is an empirical average of terms
  • Primary interest is controlling the gap between empirical and population risks
  • Evaluate the model on previously-unseen tasks (transfer learning)

Generalization guarantees for in-context learning

  • Study of i.i.d. data setting with training sequences
  • Training example (x, y) in the prompt impacts all future decisions of the algorithm
  • Stability condition borrowed from algorithmic stability literature
  • Theorem 1 shows that a multilayer transformer obeys the stability condition
  • Definition 1 and 2 introduce covering numbers and algorithm distance
  • Theorem 2 establishes generalization bounds for ICL-ERM
  • Excess MTL test risk bounded by 1/√n
  • Excess risk vanishes as n and m → ∞
  • Multiple sequences per task improves statistical error rate to 1/√m
  • MTL vs transfer learning contrasted by letting n → ∞

Generalization and inductive bias on unseen tasks

  • Explore transfer learning to assess performance of in-context learning
  • Consider standard meta-learning setup with source tasks drawn from task distribution
  • Evaluate transfer risk in terms of MTL risk
  • Transfer risk decays as 1/poly()
  • Rather than model complexity, problem complexity matters
  • Optimal prediction in terms of average risk can be described explicitly
  • Need ∝ log() samples for MTL algorithm to perform well on new prompts
  • Performance improves as number of tasks gets smaller
  • Transfer risk can be bounded in terms of distance of target task to source tasks
  • Task similarity is predictive of transfer performance

Insights on model selection

  • ICL can reduce generalization error by increasing sample size or number of sequences per task.
  • Transformer can implement ERM algorithms up to a certain accuracy.
  • Model selection can be formalized by selecting the right hypothesis class to strike a good bias-variance tradeoff.
  • Hypothesis 1 states that transformer can implement an algorithm competitive with ERM.
  • There is a minimum achievable risk over a function set.
  • ICL adaptively selects classes to achieve small risk.
  • VC-dimension of the algorithm class is as large as log.

Extending results to stable dynamical systems

Numerical evaluations

  • Linear regression experiments with in-context examples
  • Results displayed in Figures 2(a) & (b)
  • In-context learning outperforms least squares results
  • Automated model selection ability of transformers
  • Stability analysis of four function classes
  • Risk change decreases as in-context sample size increases

Conclusions and future directions

  • Approach prompt-based in-context learning as an algorithm learning problem from a statistical perspective
  • Presented generalization bounds for MTL
  • Analyzed generalization bounds for two scenarios
  • Discussed generalization guarantees on new tasks
  • Numerical simulations support hypothesis
  • In-context learning can select correct hypothesis set
  • Questions related to findings: multitask learning risk, dynamical systems, transfer learning

A stability of transformer-based icl

  • TF(S) can attain ±
  • TF(S) is properly normalized
  • Stability guarantee is provided
  • After each layer, S 2,∞ and S 2,∞ ≤ 1
  • After each layer, |TF(S ) − TF(S)| ≤ H 2,∞ S () − S () 2,1
  • Alg is the minimizer of empirical risk
  • |L (Alg) − L (Alg)| is bounded with probability at least 1 - 4
  • With probability at least 1 - 4, max ≤
  • With probability at least 1 - 4, sup Alg∈A |(Alg)| ≤
  • L T (Alg) ≤ L MTL (Alg) + dist(T , (D ) =1 )
  • T ( Alg) ≤ MTL ( Alg) + 2