Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

LLMs are able to make predictions based on contexts and a few training examples
This paper surveys and summarizes the progress, challenges, and future work in ICL
It provides a formal definition of ICL and clarifies its correlation to related studies
It discusses advanced techniques of ICL, including training strategies and prompting strategies
It presents the challenges of ICL and provides potential directions for further research

Paper Content

Introduction

Large language models (LLMs) can perform complex tasks with in-context learning (ICL)
ICL requires a few examples to form a demonstration context, usually written in natural language templates
ICL does not require parameter updates and directly performs predictions on the pretrained language models
ICL provides an interpretable interface to communicate with large language models
ICL is similar to the decision process of human beings by learning from analogy
ICL is a training-free learning framework
ICL has multiple attractive advantages, including being interpretable and training-free

Overview

ICL relies on two stages: training and inference
Training stage involves language models being trained on language modeling objectives
Inference stage involves input and output labels being represented in natural language templates
Taxonomy of ICL is shown in Fig. 2
Definition of ICL: language models learn tasks given only a few examples in the form of demonstration
ICL compared to Prompt Learning and Fewshot Learning

Model warmup

LLMs have promising ICL capability
Warmup is an optional procedure to improve ICL capability of LLMs

Supervised in-context training

Researchers proposed supervised in-context finetuning strategies to enhance ICL capability.
MetaICL eliminates the gap between pretraining and downstream ICL usage.
FLAN improves zero-shot and few-shot ICL performance.
Instruction tuning mainly considers an explanation of the task and is easy to scale up.

Self-supervised in-context training

Prompt designing

Performance of ICL depends on the demonstration surface
Prompt format, order of demonstration examples are important
Demonstrations play a vital role in ICL
Two groups of prompt designing strategies: demonstration organization and demonstration formatting

Demonstration organization

Demonstration organization focuses on selecting and sorting examples
Unsupervised methods use pre-defined metrics like L2 distance or cosine-similarity distance
Supervised methods use two-stage retrieval and reinforcement learning
Previous studies have proposed training-free methods to sort examples

Demonstration formatting

Common way to format demonstrations is concatenating examples with a specific template
Tasks that need complex reasoning are difficult to learn from only a few demonstrations
Researchers aim to design better demonstrations by adding intermediate reasoning steps and instructions

Scoring function

Direct estimation method uses conditional probability of candidate answers
Perplexity metric computes sentence perplexity of whole input sequence
Channel model estimates input query given label
Normalizing score by subtracting model-dependent prior with empty inputs can improve performance

Analysis

Analytical studies attempt to understand ICL
Factors that influence ICL performance are summarized in Table 3

What influences icl performance

Pretraining corpus domain is more important than corpus size
Combining multiple corpora may give rise to emergent ICL ability
Pretraining on corpora related to downstream tasks does not always improve ICL performance
Models with lower perplexity do not always perform better in ICL scenarios
Number of model parameters and pretraining steps influence emergent ICL abilities
Influence of demonstration samples come from four aspects: input-label pairing format, label space, input distribution, and input-label mapping
Demonstration sample order is an important factor
Demonstration samples with closer embeddings to query samples bring better performance
ICL explained as implicit Bayesian inference
Transformers can encode effective learning algorithms to learn unseen linear functions
ICL and finetuning related
Induction heads constitute the mechanism of ICL

Evaluation and resources

Traditional tasks

ICL can be tested on traditional datasets and benchmarks
GPT-3 can achieve results comparable to SOTA finetuning performance on COPA and ReCoRD
Scaling up the number of demonstration examples has limited improvement
ICL still has room to improve on traditional NLP tasks

New challenging tasks

Researchers are interested in evaluating the intrinsic capabilities of large language models without downstream task finetuning
BIG-Bench is a large benchmark covering a range of tasks
Models have outperformed human-rater results on 65% of BIG-Bench tasks
BIG-Bench Hard includes 23 unsolved tasks
Researchers are searching for tasks where model performance reduces when scaling up the model size
OPT-IML Bench consists of 2000 NLP tasks from 8 existing benchmarks
Studies focus on exploring the reasoning ability of ICL

Application

ICL has been shown to be effective for traditional NLP tasks and methods.
ICL has been used for tasks that require complexity reasoning and compositional generalization.
ICL has been used for meta-learning, where model parameters are frozen.
ICL has been used for lightweight model editing.
ICL has potential to be widely applied in data engineering.

New pretraining strategies

Language model objectives are not equal to ICL abilities.
Intermediate tuning before inference can improve performance.
Tailored pretraining objectives and metrics can raise LLMs with better ICL capabilities.

Distill the icl ability to smaller models

Previous studies have shown that in-context learning for reasoning tasks is possible with large models.
Magister et al. (2022) showed that it is possible to transfer the reasoning ability to smaller models.
Further investigation is needed to improve the reasoning ability by learning from larger models.

Knowledge augmentation and updating

LLMs may lack certain knowledge and generate hallucinations during ICL inference.
Retrieving correct knowledge and integrating it with the context in a lightweight manner is promising for ICL.
Knowledge in LLMs could be wrong or out-of-date.
In-context knowledge updating can update answers around 85% of the time.

Robustness to demonstration

Previous studies have shown ICL performance is unstable
ICL can be sensitive to many factors
Improving ICL robustness is challenging
Existing methods have accuracy/robustness dilemma or sacrifice inference efficiency
Need deeper analysis of ICL working mechanism from theoretical perspective

Icl for data engineering

ICL generates high-quality data at a low cost
ICL takes only a few examples to learn the data engineering objective
LLM has strong reasoning and text-generating ability
Improving ICL for data annotation is a direction with practical value

Conclusion

Review of advanced ICL techniques
Challenges and potential directions for future research
Training and inference stages for ICL
Training data needs to have examples appearing in clusters and enough rare classes
Chain-of-thought prompting for complex reasoning
Demonstration selection and CoT designing to improve CoT prompting ability
Scoring functions for ICL

Link to paper#

Abstract#

Paper Content#

Introduction#

Overview#

Model warmup#

Supervised in-context training#

Self-supervised in-context training#

Prompt designing#

Demonstration organization#

Demonstration formatting#

Scoring function#

Analysis#

What influences icl performance#

Evaluation and resources#

Traditional tasks#

New challenging tasks#

Application#

New pretraining strategies#

Distill the icl ability to smaller models#

Knowledge augmentation and updating#

Robustness to demonstration#

Icl for data engineering#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Overview

Model warmup

Supervised in-context training

Self-supervised in-context training

Prompt designing

Demonstration organization

Demonstration formatting

Scoring function

Analysis

What influences icl performance

Evaluation and resources

Traditional tasks

New challenging tasks

Application

New pretraining strategies

Distill the icl ability to smaller models

Knowledge augmentation and updating

Robustness to demonstration

Icl for data engineering

Conclusion