Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Large pretrained language models have surprising In-Context Learning ability.
ICL works by producing meta-gradients from demonstration examples.
ICL behaves similarly to explicit finetuning at the prediction, representation, and attention behavior levels.
Momentum-based attention is designed based on meta-optimization understanding.

Paper Content

Introduction

Large pretrained language models have strong emergent In-Context Learning (ICL) ability
ICL needs demonstration examples prepended before the original input
GPT models can exceed smaller models with supervised finetuning
ICL is explained as a process of meta-optimization
ICL is similar to explicit finetuning at multiple levels
Momentum-based attention achieves consistent performance improvements

Background

GPT model is used for In-Context Learning (ICL) for classification tasks
GPT model is stacked with L identical Transformers
Final answer is selected from candidate answer set with highest probability
Pre-defined template is used to format demonstrations before query input
Contextual model input is organized and fed into GPT model
Probability of answer is computed using output hidden state, word embedding and logit

Dual form between gradient descent based optimization and attention

Idea in paper is inspired by Aizerman et al. (1964) and Irie et al. (2022).
Linear layers optimized by gradient descent have a dual form of linear attention.
Back-propagation algorithm computes ∆W by accumulating outer products of historic input representations and gradients of their corresponding outputs.
Dual form between gradient descent based optimization and linear attention is derived.

In-context learning (icl) performs implicit finetuning

Analyzed Transformer attention under relaxed linear attention form
Compared ICL with explicit finetuning
Proposed to understand ICL as a kind of implicit finetuning

Transformer attention as meta-optimization

Input representation of a query token is denoted by x ∈ R d
Attention query vector is denoted by q = W Q x ∈ R d
Standard attention is approximated to a relaxed linear attention by removing softmax operation and scaling factor
W ZSL = W V X (W K X) T is defined as initialized parameters to be updated
Attention to demonstration tokens is equivalent to parameter updates ∆W ICL
ICL is explained as a process of meta-optimization

Comparing icl with finetuning

ICL and explicit optimization are compared
Finetuning setting is designed as a baseline for comparison
Finetuning and ICL have common properties
Both perform gradient descent
Both aim at attention
ICL is understood as a kind of implicit finetuning

Experiments

Experimental settings

Two GPT-like pretrained language models with 1.3B and 2.7B model parameters were used in experiments.
Experiments were conducted on NVIDIA V100 GPUs with 32 GB memory.
The same template was used for Zero-Shot Learning (ZSL), ICL, and finetuning.
For ICL, 32 demonstration examples were used and the random seed was tuned for each task.
For finetuning, the same demonstration examples were used and SGD was the optimizer.
The learning rate was tuned for finetuning.

Metrics

Three metrics measure the similarity between ICL and finetuning at three different levels
Rec2FTP score measures how much behavior of finetuning ICL can cover
SimAOU score measures the similarity between the effects of ICL and finetuning on ZSL
SimAM score measures the similarity of the attention behavior of ICL and finetuning

Results

Validation accuracy of ZSL, ICL, and finetuning is improved compared to ZSL
ICL is better at few-shot scenarios than finetuning
ICL updates are more similar to finetuning updates than random updates
ICL attention weights are more similar to finetuning attention weights than ZSL attention weights
ICL behaves similarly to finetuning at prediction, representation, and attention behavior levels
Introducing momentum into attention improves language modeling and in-context learning performance

Conclusion

Aim to explain working mechanism of GPT-based ICL
Dual form of ICL proposed
Understand ICL as a process of meta-optimization
Connections between ICL and finetuning
Compare behavior of ICL and finetuning
Momentum-based attention designed
Momentum-based attention achieves performance improvements

Link to paper#

Abstract#

Paper Content#

Introduction#

Background#

Dual form between gradient descent based optimization and attention#

In-context learning (icl) performs implicit finetuning#

Transformer attention as meta-optimization#

Comparing icl with finetuning#

Experiments#

Experimental settings#

Metrics#

Results#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Background

Dual form between gradient descent based optimization and attention

In-context learning (icl) performs implicit finetuning

Transformer attention as meta-optimization

Comparing icl with finetuning

Experiments

Experimental settings

Metrics

Results

Conclusion