Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Large pretrained language models have surprising In-Context Learning ability.
- ICL works by producing meta-gradients from demonstration examples.
- ICL behaves similarly to explicit finetuning at the prediction, representation, and attention behavior levels.
- Momentum-based attention is designed based on meta-optimization understanding.
Paper Content
Introduction
- Large pretrained language models have strong emergent In-Context Learning (ICL) ability
- ICL needs demonstration examples prepended before the original input
- GPT models can exceed smaller models with supervised finetuning
- ICL is explained as a process of meta-optimization
- ICL is similar to explicit finetuning at multiple levels
- Momentum-based attention achieves consistent performance improvements
Background
- GPT model is used for In-Context Learning (ICL) for classification tasks
- GPT model is stacked with L identical Transformers
- Final answer is selected from candidate answer set with highest probability
- Pre-defined template is used to format demonstrations before query input
- Contextual model input is organized and fed into GPT model
- Probability of answer is computed using output hidden state, word embedding and logit
Dual form between gradient descent based optimization and attention
- Idea in paper is inspired by Aizerman et al. (1964) and Irie et al. (2022).
- Linear layers optimized by gradient descent have a dual form of linear attention.
- Back-propagation algorithm computes โW by accumulating outer products of historic input representations and gradients of their corresponding outputs.
- Dual form between gradient descent based optimization and linear attention is derived.
In-context learning (icl) performs implicit finetuning
- Analyzed Transformer attention under relaxed linear attention form
- Compared ICL with explicit finetuning
- Proposed to understand ICL as a kind of implicit finetuning
Transformer attention as meta-optimization
- Input representation of a query token is denoted by x โ R d
- Attention query vector is denoted by q = W Q x โ R d
- Standard attention is approximated to a relaxed linear attention by removing softmax operation and scaling factor
- W ZSL = W V X (W K X) T is defined as initialized parameters to be updated
- Attention to demonstration tokens is equivalent to parameter updates โW ICL
- ICL is explained as a process of meta-optimization
Comparing icl with finetuning
- ICL and explicit optimization are compared
- Finetuning setting is designed as a baseline for comparison
- Finetuning and ICL have common properties
- Both perform gradient descent
- Both aim at attention
- ICL is understood as a kind of implicit finetuning
Experiments
Experimental settings
- Two GPT-like pretrained language models with 1.3B and 2.7B model parameters were used in experiments.
- Experiments were conducted on NVIDIA V100 GPUs with 32 GB memory.
- The same template was used for Zero-Shot Learning (ZSL), ICL, and finetuning.
- For ICL, 32 demonstration examples were used and the random seed was tuned for each task.
- For finetuning, the same demonstration examples were used and SGD was the optimizer.
- The learning rate was tuned for finetuning.
Metrics
- Three metrics measure the similarity between ICL and finetuning at three different levels
- Rec2FTP score measures how much behavior of finetuning ICL can cover
- SimAOU score measures the similarity between the effects of ICL and finetuning on ZSL
- SimAM score measures the similarity of the attention behavior of ICL and finetuning
Results
- Validation accuracy of ZSL, ICL, and finetuning is improved compared to ZSL
- ICL is better at few-shot scenarios than finetuning
- ICL updates are more similar to finetuning updates than random updates
- ICL attention weights are more similar to finetuning attention weights than ZSL attention weights
- ICL behaves similarly to finetuning at prediction, representation, and attention behavior levels
- Introducing momentum into attention improves language modeling and in-context learning performance
Conclusion
- Aim to explain working mechanism of GPT-based ICL
- Dual form of ICL proposed
- Understand ICL as a process of meta-optimization
- Connections between ICL and finetuning
- Compare behavior of ICL and finetuning
- Momentum-based attention designed
- Momentum-based attention achieves performance improvements