Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Language models perform better with increased scale
In-context learning paradigm is used
Investigated hypothesis that ability of large language model to in-context learn is not uniformly spread
66 billion parameter language model used across 14 tasks
70% of attention heads and 20% of feed forward networks can be removed with minimal decline in performance
Overlap in set of attention heads important for in-context learning across tasks and number of examples

Paper Content

Introduction

LLMs based on Transformer architecture have revolutionized NLP
Zero/few-shot incontext learning paradigm is used
Question: Are all LLM components needed to perform in-context learning?
Task-specific importance scores and structured pruning used to answer question
Up to 70% of attention heads can be removed with minimal decline in performance
Attention heads important for in-context learning overlap across tasks and shots
Up to 20% of FFNs can be removed with minimal decline in performance
Primitive operations associated with in-context learning quantified
Small set of heads have nontrivial scores for primitive operations
Heads overlap with ones identified to be important for in-context learning

Background & methods

OPT model used for experiments
Background on in-context learning and mathematical formulation of induction heads
Adaptation of oracle and gradient-based importance score formulations for in-context learning

Open pre-trained transformer (opt)

OPT is a suite of language models of varying sizes, the largest being OPT-66B with 66 billion parameters.
OPT-66B was pre-trained with a maximum sequence length of 2048 and embedding dimension d e = 9216.
OPT-66B has multiple decoder layers consisting of multi-headed attention (MHA) blocks, layer norm (LN) and feed forward networks (FFN).
There are 72 attention heads of dimension d h = 128 in every layer.
There are 4608 attention heads across 64 layers in OPT-66B, constituting 21.7B of the total 66B parameters.

In-context learning & induction heads

In-context learning is a new paradigm of learning for language models
In-context learning involves generating output text based on a few or zero training examples
Few-shot in-context learning can involve explicit primitive interactions between the in-context examples
Olsson et al. (2022) developed a mathematical framework to better understand the mechanics of in-context learning
Primitive operations such as prefix matching and copying are used to measure the capacity of attention heads to independently perform in-context learning

Experimental setup

Experiments performed on 14 NLP datasets/tasks
Evaluation metric is accuracy
Datasets include ARC Easy and Challenge, OpenBookQA, HellaSwag, PIQA, Winogrande, BoolQ, CB, COPA, MultiRC, ReCoRD, RTE, WiC, and WSC
For out-of-distribution generalization, MathQA and LAMBADA datasets used
Eleuther AI’s lmevaluation-harness framework used for experiments

Importance scores for opt-66b

OPT-66B uses importance scores for attention heads and feed forward networks
OPT-66B is used for zero-shot and fewshot (1-shot and 5-shot) in-context learning

Attention heads

Heatmaps of head importance scores are shown in Figure 3.
Important attention heads are clustered in intermediate layers of OPT-66B.
There is some overlap in important attention heads across zero/few-shot settings.

Feed forward networks

Oracle importance scores are computed for each of the 64 FFNs in OPT-66B
In the zero-shot and one-shot settings, the removal of any FFN in the early layers of OPT-66B either gives comparable or better performance for a vast majority of tasks
In the five-shot setting, both the early and later layers seem to have important FFNs for most tasks
High variance in FFN importance scores in later layers, with absolute accuracy improvements/degradation of up to 20%

Iterative pruning

Observations in previous section showed uneven ability to in-context learn-perform a task in OPT-66B
Assessing how much task performance declines when removing multiple attention heads and FFNs

Removing attention heads

Sorted attention heads in ascending order by importance score
Removed 10% of attention heads at a time and re-evaluated task performance
Average accuracy across tasks does not change much until 70% of attention heads are removed
Some tasks show oddities such as zero-shot accuracy increasing after 70% of attention heads are removed

Removing ffns

Sort FFNs in ascending order by importance score
Remove 10% of FFNs at a time and re-evaluate task performance
Average accuracy across tasks does not change until 20% of FFNs are removed
Inflection point after which sharp decline in accuracy changes to 10% for few-shot settings
FFNs play critical role toward in-context learning

Combined removal of heads & ffns

Investigated whether inflection points to in-context learning performance still hold when removing attention heads and FFNs in tandem
Removing 70% of attention heads and 20% of FFNs leads to 5% absolute drop in zero-shot accuracy
Removing 70% of attention heads and 10% of FFNs leads to 6% absolute drop in one-shot accuracy
Removing 60% of attention heads and 20% of FFNs leads to 4% absolute drop in five-shot accuracy
Deviation of inflection points by at most 10% absolute due to interplay between heads and FFNs

Detailed analysis of attention heads

Analysis of attention heads in OPT-66B
Cross-task analysis to understand shared (un)important attention heads
Cross-shot analysis to study (un)important attention heads in zero-shot and few-shot settings
Quantifying capacity of attention heads to perform task-agnostic induction operations

Cross-task analysis

SRCC is used to measure overlap in (un)important attention heads across tasks
SRCC is positive for every pair of tasks in the zero-shot and few-shot settings
SRCC is lower between every task and ReCoRD, a long reading comprehension task
Pruning using the rankings described has almost no effect on accuracy up to the 50% mark
Sharp decline in accuracy on COPA and Winogrande when pruned to the 70% mark using ReCoRD ranking
71% and 76% overlap between the top 30% important attention heads for ReCoRD-COPA and ReCoRD-Winogrande respectively
Accuracy on ReCoRD at the 70% pruning mark is better using the aggregate ranking than using the ranking for ReCoRD itself
Self-ranking accuracy curves are higher than the aggregate ranking accuracy curves for MathQA

Cross-shot analysis

Spearman’s rank correlation coefficient (SRCC) was used to measure the importance of attention heads across zero and few-shot settings.
SRCC was higher for rankings within the few-shot setting than for rankings across the zero and few-shot settings, indicating non-trivial overlap in the (un)important attention heads for tasks across shots.

Induction heads in opt-66b

Quantified capacity of attention heads to perform prefix matching and copying
Small subset of attention heads in OPT-66B have high prefix matching scores
Relatively larger number of attention heads with high copying scores
Induction heads overlap with task-aggregated important attention heads

Interest in leveraging in-context learning paradigm
Analyzing and interpreting how attention works
Different formulations for head importance
Trend of increasing model scale
Focus on pre-training loss curve in scaling laws
Empirical observations rely on simple greedy approach to training-free pruning
Greedy approach is sub-optimal and produces under-estimates
Need to re-compute importance scores after removal of each attention head or FFN

Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale

Link to paper

Abstract

Paper Content

Introduction

Background & methods

Open pre-trained transformer (opt)

In-context learning & induction heads

Experimental setup

Importance scores for opt-66b

Attention heads

Feed forward networks

Iterative pruning

Removing attention heads

Removing ffns

Combined removal of heads & ffns

Detailed analysis of attention heads

Cross-task analysis

Cross-shot analysis

Induction heads in opt-66b

Conclusion & future work

Link to paper#

Abstract#

Paper Content#

Introduction#

Background & methods#

Open pre-trained transformer (opt)#

In-context learning & induction heads#

Experimental setup#

Importance scores for opt-66b#

Attention heads#

Feed forward networks#

Iterative pruning#

Removing attention heads#

Removing ffns#

Combined removal of heads & ffns#

Detailed analysis of attention heads#

Cross-task analysis#

Cross-shot analysis#

Induction heads in opt-66b#

Related work#

Conclusion & future work#

Link to paper

Abstract

Paper Content

Introduction

Background & methods

Open pre-trained transformer (opt)

In-context learning & induction heads

Experimental setup

Importance scores for opt-66b

Attention heads

Feed forward networks

Iterative pruning

Removing attention heads

Removing ffns

Combined removal of heads & ffns

Detailed analysis of attention heads

Cross-task analysis

Cross-shot analysis

Induction heads in opt-66b

Related work

Conclusion & future work