Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- In-context learning (ICL) is a type of prompting where a transformer model operates on a sequence of (input, output) examples and performs inference on-the-fly.
- ICL is an algorithm learning problem, treating the transformer model as a learning algorithm that can be specialized via training.
- Multitask learning is used to obtain generalization bounds for ICL.
- Transformers can act as an adaptive learning algorithm and perform model selection across different hypothesis classes.
Paper Content
Problem setup
- Let X be the input feature space and Y be the output/label space
- Use boldface for vector variables
- [] denotes a set of numbers
- ℓ denotes the ℓ-norm
- Prompts are obtained from a sequence of i.i.d. (input, label) pairs or a trajectory of a dynamical system
- Model aims to predict the next output ŷ
- Training phase of ICL involves multitask learning
- Training sequences S = (z ) =1 are drawn from a data distribution
- Model is abstracted as a learning algorithm that maps a sequence of data to a prediction function
- Training objective is formulated as searching for the optimal algorithm Alg ∈ A
- Task-specific loss L (Alg) is an empirical average of terms
- Primary interest is controlling the gap between empirical and population risks
- Evaluate the model on previously-unseen tasks (transfer learning)
Generalization guarantees for in-context learning
- Study of i.i.d. data setting with training sequences
- Training example (x, y) in the prompt impacts all future decisions of the algorithm
- Stability condition borrowed from algorithmic stability literature
- Theorem 1 shows that a multilayer transformer obeys the stability condition
- Definition 1 and 2 introduce covering numbers and algorithm distance
- Theorem 2 establishes generalization bounds for ICL-ERM
- Excess MTL test risk bounded by 1/√n
- Excess risk vanishes as n and m → ∞
- Multiple sequences per task improves statistical error rate to 1/√m
- MTL vs transfer learning contrasted by letting n → ∞
Generalization and inductive bias on unseen tasks
- Explore transfer learning to assess performance of in-context learning
- Consider standard meta-learning setup with source tasks drawn from task distribution
- Evaluate transfer risk in terms of MTL risk
- Transfer risk decays as 1/poly()
- Rather than model complexity, problem complexity matters
- Optimal prediction in terms of average risk can be described explicitly
- Need ∝ log() samples for MTL algorithm to perform well on new prompts
- Performance improves as number of tasks gets smaller
- Transfer risk can be bounded in terms of distance of target task to source tasks
- Task similarity is predictive of transfer performance
Insights on model selection
- ICL can reduce generalization error by increasing sample size or number of sequences per task.
- Transformer can implement ERM algorithms up to a certain accuracy.
- Model selection can be formalized by selecting the right hypothesis class to strike a good bias-variance tradeoff.
- Hypothesis 1 states that transformer can implement an algorithm competitive with ERM.
- There is a minimum achievable risk over a function set.
- ICL adaptively selects classes to achieve small risk.
- VC-dimension of the algorithm class is as large as log.
Extending results to stable dynamical systems
Numerical evaluations
- Linear regression experiments with in-context examples
- Results displayed in Figures 2(a) & (b)
- In-context learning outperforms least squares results
- Automated model selection ability of transformers
- Stability analysis of four function classes
- Risk change decreases as in-context sample size increases
Conclusions and future directions
- Approach prompt-based in-context learning as an algorithm learning problem from a statistical perspective
- Presented generalization bounds for MTL
- Analyzed generalization bounds for two scenarios
- Discussed generalization guarantees on new tasks
- Numerical simulations support hypothesis
- In-context learning can select correct hypothesis set
- Questions related to findings: multitask learning risk, dynamical systems, transfer learning
A stability of transformer-based icl
- TF(S) can attain ±
- TF(S) is properly normalized
- Stability guarantee is provided
- After each layer, S 2,∞ and S 2,∞ ≤ 1
- After each layer, |TF(S ) − TF(S)| ≤ H 2,∞ S () − S () 2,1
- Alg is the minimizer of empirical risk
- |L (Alg) − L (Alg)| is bounded with probability at least 1 - 4
- With probability at least 1 - 4, max ≤
- With probability at least 1 - 4, sup Alg∈A |(Alg)| ≤
- L T (Alg) ≤ L MTL (Alg) + dist(T , (D ) =1 )
- T ( Alg) ≤ MTL ( Alg) + 2