Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Large pretrained language models have surprising In-Context Learning ability.
  • ICL works by producing meta-gradients from demonstration examples.
  • ICL behaves similarly to explicit finetuning at the prediction, representation, and attention behavior levels.
  • Momentum-based attention is designed based on meta-optimization understanding.

Paper Content

Introduction

  • Large pretrained language models have strong emergent In-Context Learning (ICL) ability
  • ICL needs demonstration examples prepended before the original input
  • GPT models can exceed smaller models with supervised finetuning
  • ICL is explained as a process of meta-optimization
  • ICL is similar to explicit finetuning at multiple levels
  • Momentum-based attention achieves consistent performance improvements

Background

  • GPT model is used for In-Context Learning (ICL) for classification tasks
  • GPT model is stacked with L identical Transformers
  • Final answer is selected from candidate answer set with highest probability
  • Pre-defined template is used to format demonstrations before query input
  • Contextual model input is organized and fed into GPT model
  • Probability of answer is computed using output hidden state, word embedding and logit

Dual form between gradient descent based optimization and attention

  • Idea in paper is inspired by Aizerman et al. (1964) and Irie et al. (2022).
  • Linear layers optimized by gradient descent have a dual form of linear attention.
  • Back-propagation algorithm computes โˆ†W by accumulating outer products of historic input representations and gradients of their corresponding outputs.
  • Dual form between gradient descent based optimization and linear attention is derived.

In-context learning (icl) performs implicit finetuning

  • Analyzed Transformer attention under relaxed linear attention form
  • Compared ICL with explicit finetuning
  • Proposed to understand ICL as a kind of implicit finetuning

Transformer attention as meta-optimization

  • Input representation of a query token is denoted by x โˆˆ R d
  • Attention query vector is denoted by q = W Q x โˆˆ R d
  • Standard attention is approximated to a relaxed linear attention by removing softmax operation and scaling factor
  • W ZSL = W V X (W K X) T is defined as initialized parameters to be updated
  • Attention to demonstration tokens is equivalent to parameter updates โˆ†W ICL
  • ICL is explained as a process of meta-optimization

Comparing icl with finetuning

  • ICL and explicit optimization are compared
  • Finetuning setting is designed as a baseline for comparison
  • Finetuning and ICL have common properties
  • Both perform gradient descent
  • Both aim at attention
  • ICL is understood as a kind of implicit finetuning

Experiments

Experimental settings

  • Two GPT-like pretrained language models with 1.3B and 2.7B model parameters were used in experiments.
  • Experiments were conducted on NVIDIA V100 GPUs with 32 GB memory.
  • The same template was used for Zero-Shot Learning (ZSL), ICL, and finetuning.
  • For ICL, 32 demonstration examples were used and the random seed was tuned for each task.
  • For finetuning, the same demonstration examples were used and SGD was the optimizer.
  • The learning rate was tuned for finetuning.

Metrics

  • Three metrics measure the similarity between ICL and finetuning at three different levels
  • Rec2FTP score measures how much behavior of finetuning ICL can cover
  • SimAOU score measures the similarity between the effects of ICL and finetuning on ZSL
  • SimAM score measures the similarity of the attention behavior of ICL and finetuning

Results

  • Validation accuracy of ZSL, ICL, and finetuning is improved compared to ZSL
  • ICL is better at few-shot scenarios than finetuning
  • ICL updates are more similar to finetuning updates than random updates
  • ICL attention weights are more similar to finetuning attention weights than ZSL attention weights
  • ICL behaves similarly to finetuning at prediction, representation, and attention behavior levels
  • Introducing momentum into attention improves language modeling and in-context learning performance

Conclusion

  • Aim to explain working mechanism of GPT-based ICL
  • Dual form of ICL proposed
  • Understand ICL as a process of meta-optimization
  • Connections between ICL and finetuning
  • Compare behavior of ICL and finetuning
  • Momentum-based attention designed
  • Momentum-based attention achieves performance improvements