Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Language models perform better with increased scale
- In-context learning paradigm is used
- Investigated hypothesis that ability of large language model to in-context learn is not uniformly spread
- 66 billion parameter language model used across 14 tasks
- 70% of attention heads and 20% of feed forward networks can be removed with minimal decline in performance
- Overlap in set of attention heads important for in-context learning across tasks and number of examples
Paper Content
Introduction
- LLMs based on Transformer architecture have revolutionized NLP
- Zero/few-shot incontext learning paradigm is used
- Question: Are all LLM components needed to perform in-context learning?
- Task-specific importance scores and structured pruning used to answer question
- Up to 70% of attention heads can be removed with minimal decline in performance
- Attention heads important for in-context learning overlap across tasks and shots
- Up to 20% of FFNs can be removed with minimal decline in performance
- Primitive operations associated with in-context learning quantified
- Small set of heads have nontrivial scores for primitive operations
- Heads overlap with ones identified to be important for in-context learning
Background & methods
- OPT model used for experiments
- Background on in-context learning and mathematical formulation of induction heads
- Adaptation of oracle and gradient-based importance score formulations for in-context learning
Open pre-trained transformer (opt)
- OPT is a suite of language models of varying sizes, the largest being OPT-66B with 66 billion parameters.
- OPT-66B was pre-trained with a maximum sequence length of 2048 and embedding dimension d e = 9216.
- OPT-66B has multiple decoder layers consisting of multi-headed attention (MHA) blocks, layer norm (LN) and feed forward networks (FFN).
- There are 72 attention heads of dimension d h = 128 in every layer.
- There are 4608 attention heads across 64 layers in OPT-66B, constituting 21.7B of the total 66B parameters.
In-context learning & induction heads
- In-context learning is a new paradigm of learning for language models
- In-context learning involves generating output text based on a few or zero training examples
- Few-shot in-context learning can involve explicit primitive interactions between the in-context examples
- Olsson et al. (2022) developed a mathematical framework to better understand the mechanics of in-context learning
- Primitive operations such as prefix matching and copying are used to measure the capacity of attention heads to independently perform in-context learning
Experimental setup
- Experiments performed on 14 NLP datasets/tasks
- Evaluation metric is accuracy
- Datasets include ARC Easy and Challenge, OpenBookQA, HellaSwag, PIQA, Winogrande, BoolQ, CB, COPA, MultiRC, ReCoRD, RTE, WiC, and WSC
- For out-of-distribution generalization, MathQA and LAMBADA datasets used
- Eleuther AI’s lmevaluation-harness framework used for experiments
Importance scores for opt-66b
- OPT-66B uses importance scores for attention heads and feed forward networks
- OPT-66B is used for zero-shot and fewshot (1-shot and 5-shot) in-context learning
Attention heads
- Heatmaps of head importance scores are shown in Figure 3.
- Important attention heads are clustered in intermediate layers of OPT-66B.
- There is some overlap in important attention heads across zero/few-shot settings.
Feed forward networks
- Oracle importance scores are computed for each of the 64 FFNs in OPT-66B
- In the zero-shot and one-shot settings, the removal of any FFN in the early layers of OPT-66B either gives comparable or better performance for a vast majority of tasks
- In the five-shot setting, both the early and later layers seem to have important FFNs for most tasks
- High variance in FFN importance scores in later layers, with absolute accuracy improvements/degradation of up to 20%
Iterative pruning
- Observations in previous section showed uneven ability to in-context learn-perform a task in OPT-66B
- Assessing how much task performance declines when removing multiple attention heads and FFNs
Removing attention heads
- Sorted attention heads in ascending order by importance score
- Removed 10% of attention heads at a time and re-evaluated task performance
- Average accuracy across tasks does not change much until 70% of attention heads are removed
- Some tasks show oddities such as zero-shot accuracy increasing after 70% of attention heads are removed
Removing ffns
- Sort FFNs in ascending order by importance score
- Remove 10% of FFNs at a time and re-evaluate task performance
- Average accuracy across tasks does not change until 20% of FFNs are removed
- Inflection point after which sharp decline in accuracy changes to 10% for few-shot settings
- FFNs play critical role toward in-context learning
Combined removal of heads & ffns
- Investigated whether inflection points to in-context learning performance still hold when removing attention heads and FFNs in tandem
- Removing 70% of attention heads and 20% of FFNs leads to 5% absolute drop in zero-shot accuracy
- Removing 70% of attention heads and 10% of FFNs leads to 6% absolute drop in one-shot accuracy
- Removing 60% of attention heads and 20% of FFNs leads to 4% absolute drop in five-shot accuracy
- Deviation of inflection points by at most 10% absolute due to interplay between heads and FFNs
Detailed analysis of attention heads
- Analysis of attention heads in OPT-66B
- Cross-task analysis to understand shared (un)important attention heads
- Cross-shot analysis to study (un)important attention heads in zero-shot and few-shot settings
- Quantifying capacity of attention heads to perform task-agnostic induction operations
Cross-task analysis
- SRCC is used to measure overlap in (un)important attention heads across tasks
- SRCC is positive for every pair of tasks in the zero-shot and few-shot settings
- SRCC is lower between every task and ReCoRD, a long reading comprehension task
- Pruning using the rankings described has almost no effect on accuracy up to the 50% mark
- Sharp decline in accuracy on COPA and Winogrande when pruned to the 70% mark using ReCoRD ranking
- 71% and 76% overlap between the top 30% important attention heads for ReCoRD-COPA and ReCoRD-Winogrande respectively
- Accuracy on ReCoRD at the 70% pruning mark is better using the aggregate ranking than using the ranking for ReCoRD itself
- Self-ranking accuracy curves are higher than the aggregate ranking accuracy curves for MathQA
Cross-shot analysis
- Spearman’s rank correlation coefficient (SRCC) was used to measure the importance of attention heads across zero and few-shot settings.
- SRCC was higher for rankings within the few-shot setting than for rankings across the zero and few-shot settings, indicating non-trivial overlap in the (un)important attention heads for tasks across shots.
Induction heads in opt-66b
- Quantified capacity of attention heads to perform prefix matching and copying
- Small subset of attention heads in OPT-66B have high prefix matching scores
- Relatively larger number of attention heads with high copying scores
- Induction heads overlap with task-aggregated important attention heads
Related work
- Interest in leveraging in-context learning paradigm
- Analyzing and interpreting how attention works
- Different formulations for head importance
- Trend of increasing model scale
- Focus on pre-training loss curve in scaling laws
- Empirical observations rely on simple greedy approach to training-free pruning
- Greedy approach is sub-optimal and produces under-estimates
- Need to re-compute importance scores after removal of each attention head or FFN