Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Language models perform better with increased scale
  • In-context learning paradigm is used
  • Investigated hypothesis that ability of large language model to in-context learn is not uniformly spread
  • 66 billion parameter language model used across 14 tasks
  • 70% of attention heads and 20% of feed forward networks can be removed with minimal decline in performance
  • Overlap in set of attention heads important for in-context learning across tasks and number of examples

Paper Content

Introduction

  • LLMs based on Transformer architecture have revolutionized NLP
  • Zero/few-shot incontext learning paradigm is used
  • Question: Are all LLM components needed to perform in-context learning?
  • Task-specific importance scores and structured pruning used to answer question
  • Up to 70% of attention heads can be removed with minimal decline in performance
  • Attention heads important for in-context learning overlap across tasks and shots
  • Up to 20% of FFNs can be removed with minimal decline in performance
  • Primitive operations associated with in-context learning quantified
  • Small set of heads have nontrivial scores for primitive operations
  • Heads overlap with ones identified to be important for in-context learning

Background & methods

  • OPT model used for experiments
  • Background on in-context learning and mathematical formulation of induction heads
  • Adaptation of oracle and gradient-based importance score formulations for in-context learning

Open pre-trained transformer (opt)

  • OPT is a suite of language models of varying sizes, the largest being OPT-66B with 66 billion parameters.
  • OPT-66B was pre-trained with a maximum sequence length of 2048 and embedding dimension d e = 9216.
  • OPT-66B has multiple decoder layers consisting of multi-headed attention (MHA) blocks, layer norm (LN) and feed forward networks (FFN).
  • There are 72 attention heads of dimension d h = 128 in every layer.
  • There are 4608 attention heads across 64 layers in OPT-66B, constituting 21.7B of the total 66B parameters.

In-context learning & induction heads

  • In-context learning is a new paradigm of learning for language models
  • In-context learning involves generating output text based on a few or zero training examples
  • Few-shot in-context learning can involve explicit primitive interactions between the in-context examples
  • Olsson et al. (2022) developed a mathematical framework to better understand the mechanics of in-context learning
  • Primitive operations such as prefix matching and copying are used to measure the capacity of attention heads to independently perform in-context learning

Experimental setup

  • Experiments performed on 14 NLP datasets/tasks
  • Evaluation metric is accuracy
  • Datasets include ARC Easy and Challenge, OpenBookQA, HellaSwag, PIQA, Winogrande, BoolQ, CB, COPA, MultiRC, ReCoRD, RTE, WiC, and WSC
  • For out-of-distribution generalization, MathQA and LAMBADA datasets used
  • Eleuther AI’s lmevaluation-harness framework used for experiments

Importance scores for opt-66b

  • OPT-66B uses importance scores for attention heads and feed forward networks
  • OPT-66B is used for zero-shot and fewshot (1-shot and 5-shot) in-context learning

Attention heads

  • Heatmaps of head importance scores are shown in Figure 3.
  • Important attention heads are clustered in intermediate layers of OPT-66B.
  • There is some overlap in important attention heads across zero/few-shot settings.

Feed forward networks

  • Oracle importance scores are computed for each of the 64 FFNs in OPT-66B
  • In the zero-shot and one-shot settings, the removal of any FFN in the early layers of OPT-66B either gives comparable or better performance for a vast majority of tasks
  • In the five-shot setting, both the early and later layers seem to have important FFNs for most tasks
  • High variance in FFN importance scores in later layers, with absolute accuracy improvements/degradation of up to 20%

Iterative pruning

  • Observations in previous section showed uneven ability to in-context learn-perform a task in OPT-66B
  • Assessing how much task performance declines when removing multiple attention heads and FFNs

Removing attention heads

  • Sorted attention heads in ascending order by importance score
  • Removed 10% of attention heads at a time and re-evaluated task performance
  • Average accuracy across tasks does not change much until 70% of attention heads are removed
  • Some tasks show oddities such as zero-shot accuracy increasing after 70% of attention heads are removed

Removing ffns

  • Sort FFNs in ascending order by importance score
  • Remove 10% of FFNs at a time and re-evaluate task performance
  • Average accuracy across tasks does not change until 20% of FFNs are removed
  • Inflection point after which sharp decline in accuracy changes to 10% for few-shot settings
  • FFNs play critical role toward in-context learning

Combined removal of heads & ffns

  • Investigated whether inflection points to in-context learning performance still hold when removing attention heads and FFNs in tandem
  • Removing 70% of attention heads and 20% of FFNs leads to 5% absolute drop in zero-shot accuracy
  • Removing 70% of attention heads and 10% of FFNs leads to 6% absolute drop in one-shot accuracy
  • Removing 60% of attention heads and 20% of FFNs leads to 4% absolute drop in five-shot accuracy
  • Deviation of inflection points by at most 10% absolute due to interplay between heads and FFNs

Detailed analysis of attention heads

  • Analysis of attention heads in OPT-66B
  • Cross-task analysis to understand shared (un)important attention heads
  • Cross-shot analysis to study (un)important attention heads in zero-shot and few-shot settings
  • Quantifying capacity of attention heads to perform task-agnostic induction operations

Cross-task analysis

  • SRCC is used to measure overlap in (un)important attention heads across tasks
  • SRCC is positive for every pair of tasks in the zero-shot and few-shot settings
  • SRCC is lower between every task and ReCoRD, a long reading comprehension task
  • Pruning using the rankings described has almost no effect on accuracy up to the 50% mark
  • Sharp decline in accuracy on COPA and Winogrande when pruned to the 70% mark using ReCoRD ranking
  • 71% and 76% overlap between the top 30% important attention heads for ReCoRD-COPA and ReCoRD-Winogrande respectively
  • Accuracy on ReCoRD at the 70% pruning mark is better using the aggregate ranking than using the ranking for ReCoRD itself
  • Self-ranking accuracy curves are higher than the aggregate ranking accuracy curves for MathQA

Cross-shot analysis

  • Spearman’s rank correlation coefficient (SRCC) was used to measure the importance of attention heads across zero and few-shot settings.
  • SRCC was higher for rankings within the few-shot setting than for rankings across the zero and few-shot settings, indicating non-trivial overlap in the (un)important attention heads for tasks across shots.

Induction heads in opt-66b

  • Quantified capacity of attention heads to perform prefix matching and copying
  • Small subset of attention heads in OPT-66B have high prefix matching scores
  • Relatively larger number of attention heads with high copying scores
  • Induction heads overlap with task-aggregated important attention heads
  • Interest in leveraging in-context learning paradigm
  • Analyzing and interpreting how attention works
  • Different formulations for head importance
  • Trend of increasing model scale
  • Focus on pre-training loss curve in scaling laws
  • Empirical observations rely on simple greedy approach to training-free pruning
  • Greedy approach is sub-optimal and produces under-estimates
  • Need to re-compute importance scores after removal of each attention head or FFN

Conclusion & future work