Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- LLMs are able to make predictions based on contexts and a few training examples
- This paper surveys and summarizes the progress, challenges, and future work in ICL
- It provides a formal definition of ICL and clarifies its correlation to related studies
- It discusses advanced techniques of ICL, including training strategies and prompting strategies
- It presents the challenges of ICL and provides potential directions for further research
Paper Content
Introduction
- Large language models (LLMs) can perform complex tasks with in-context learning (ICL)
- ICL requires a few examples to form a demonstration context, usually written in natural language templates
- ICL does not require parameter updates and directly performs predictions on the pretrained language models
- ICL provides an interpretable interface to communicate with large language models
- ICL is similar to the decision process of human beings by learning from analogy
- ICL is a training-free learning framework
- ICL has multiple attractive advantages, including being interpretable and training-free
Overview
- ICL relies on two stages: training and inference
- Training stage involves language models being trained on language modeling objectives
- Inference stage involves input and output labels being represented in natural language templates
- Taxonomy of ICL is shown in Fig. 2
- Definition of ICL: language models learn tasks given only a few examples in the form of demonstration
- ICL compared to Prompt Learning and Fewshot Learning
Model warmup
- LLMs have promising ICL capability
- Warmup is an optional procedure to improve ICL capability of LLMs
Supervised in-context training
- Researchers proposed supervised in-context finetuning strategies to enhance ICL capability.
- MetaICL eliminates the gap between pretraining and downstream ICL usage.
- FLAN improves zero-shot and few-shot ICL performance.
- Instruction tuning mainly considers an explanation of the task and is easy to scale up.
Self-supervised in-context training
Prompt designing
- Performance of ICL depends on the demonstration surface
- Prompt format, order of demonstration examples are important
- Demonstrations play a vital role in ICL
- Two groups of prompt designing strategies: demonstration organization and demonstration formatting
Demonstration organization
- Demonstration organization focuses on selecting and sorting examples
- Unsupervised methods use pre-defined metrics like L2 distance or cosine-similarity distance
- Supervised methods use two-stage retrieval and reinforcement learning
- Previous studies have proposed training-free methods to sort examples
Demonstration formatting
- Common way to format demonstrations is concatenating examples with a specific template
- Tasks that need complex reasoning are difficult to learn from only a few demonstrations
- Researchers aim to design better demonstrations by adding intermediate reasoning steps and instructions
Scoring function
- Direct estimation method uses conditional probability of candidate answers
- Perplexity metric computes sentence perplexity of whole input sequence
- Channel model estimates input query given label
- Normalizing score by subtracting model-dependent prior with empty inputs can improve performance
Analysis
- Analytical studies attempt to understand ICL
- Factors that influence ICL performance are summarized in Table 3
What influences icl performance
- Pretraining corpus domain is more important than corpus size
- Combining multiple corpora may give rise to emergent ICL ability
- Pretraining on corpora related to downstream tasks does not always improve ICL performance
- Models with lower perplexity do not always perform better in ICL scenarios
- Number of model parameters and pretraining steps influence emergent ICL abilities
- Influence of demonstration samples come from four aspects: input-label pairing format, label space, input distribution, and input-label mapping
- Demonstration sample order is an important factor
- Demonstration samples with closer embeddings to query samples bring better performance
- ICL explained as implicit Bayesian inference
- Transformers can encode effective learning algorithms to learn unseen linear functions
- ICL and finetuning related
- Induction heads constitute the mechanism of ICL
Evaluation and resources
Traditional tasks
- ICL can be tested on traditional datasets and benchmarks
- GPT-3 can achieve results comparable to SOTA finetuning performance on COPA and ReCoRD
- Scaling up the number of demonstration examples has limited improvement
- ICL still has room to improve on traditional NLP tasks
New challenging tasks
- Researchers are interested in evaluating the intrinsic capabilities of large language models without downstream task finetuning
- BIG-Bench is a large benchmark covering a range of tasks
- Models have outperformed human-rater results on 65% of BIG-Bench tasks
- BIG-Bench Hard includes 23 unsolved tasks
- Researchers are searching for tasks where model performance reduces when scaling up the model size
- OPT-IML Bench consists of 2000 NLP tasks from 8 existing benchmarks
- Studies focus on exploring the reasoning ability of ICL
Application
- ICL has been shown to be effective for traditional NLP tasks and methods.
- ICL has been used for tasks that require complexity reasoning and compositional generalization.
- ICL has been used for meta-learning, where model parameters are frozen.
- ICL has been used for lightweight model editing.
- ICL has potential to be widely applied in data engineering.
New pretraining strategies
- Language model objectives are not equal to ICL abilities.
- Intermediate tuning before inference can improve performance.
- Tailored pretraining objectives and metrics can raise LLMs with better ICL capabilities.
Distill the icl ability to smaller models
- Previous studies have shown that in-context learning for reasoning tasks is possible with large models.
- Magister et al. (2022) showed that it is possible to transfer the reasoning ability to smaller models.
- Further investigation is needed to improve the reasoning ability by learning from larger models.
Knowledge augmentation and updating
- LLMs may lack certain knowledge and generate hallucinations during ICL inference.
- Retrieving correct knowledge and integrating it with the context in a lightweight manner is promising for ICL.
- Knowledge in LLMs could be wrong or out-of-date.
- In-context knowledge updating can update answers around 85% of the time.
Robustness to demonstration
- Previous studies have shown ICL performance is unstable
- ICL can be sensitive to many factors
- Improving ICL robustness is challenging
- Existing methods have accuracy/robustness dilemma or sacrifice inference efficiency
- Need deeper analysis of ICL working mechanism from theoretical perspective
Icl for data engineering
- ICL generates high-quality data at a low cost
- ICL takes only a few examples to learn the data engineering objective
- LLM has strong reasoning and text-generating ability
- Improving ICL for data annotation is a direction with practical value
Conclusion
- Review of advanced ICL techniques
- Challenges and potential directions for future research
- Training and inference stages for ICL
- Training data needs to have examples appearing in clusters and enough rare classes
- Chain-of-thought prompting for complex reasoning
- Demonstration selection and CoT designing to improve CoT prompting ability
- Scoring functions for ICL