Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Self-attention weights are used to analyze token-to-token interactions in Transformer-based models.
Other components in the encoder layer can also affect information mixing in the output representations.
Value Zeroing is a novel context mixing score customized for Transformers that provides a deeper understanding of how information is mixed.
Evaluations are done with different view points based on linguistically informed rationales, probing, and faithfulness analysis.

Paper Content

Introduction

Transformers are used to learn contextualized representations across a range of modalities
Attention weights are used to understand the information flow from the input embeddings to the output representation
Attention weights can be unreliable and uninformative
Value Zeroing is a novel approach to quantify the contribution of context tokens to the output representation
Value Zeroing is evaluated using grammatical agreement tasks, information-theoretic probing, and faithfulness to model decisions
Value Zeroing is more plausible, human-interpretable, and faithful to models’ decisions than other analysis methods

Numerous studies have used self-attention weights to gain insight into Transformers
Debate exists on whether attention weights are suitable for interpreting the model
Post-processing interpretability techniques have been proposed to convert weights into scores
Abnar and Zuidema (2020) proposed attention-flow and attention-rollout methods
Kobayashi et al. (2020) proposed a method incorporating the norm of transformed value vectors
Kobayashi et al. (2021) extended the method to the whole self-attention block
Brunner et al. (2020) and Pascual et al. (2021) used a gradient-based approach
Modarressi et al. (2022) and Ferrando et al. (2022) combined existing approaches
New context mixing score proposed to take into account all components in a Transformer encoder block

Background and notation

Transformer encoder layer is composed of two sublayers: multi-head self-attention mechanism (MHA) and position-wise fully connected feed-forward network (FFN).
Each input vector xi is transformed into a query qhi, a key khi, and a value vhi vector via separate trainable linear transformations.
Context vector zhi for the ith token is generated as a weighted sum over the transformed value vectors.
Output token representations xi are produced by two linear transformations with a ReLU activation in between.

Value zeroing

Aim to measure how much a token uses other context tokens to build its output representation
Treat self-attention mechanism as a fuzzy hash-table
Replace value vector associated with token j with a zero vector
Compare alternative output representation with original to measure how much output representation is affected by exclusion of j th token
Compute Value Zeroing matrix score to understand context mixing process

Data

Used BLiMP benchmark to evaluate context mixing scores
Benchmark isolates linguistic phenomena so only one word determines true label of each sentence
Selected five datasets with three different linguistic phenomena
Expanded contractions and generated dependency trees using SpaCy
Unified dataset for grammatical agreement with 4,276 data points, divided into Train and Test sets

Target model

Experiments conducted on 3 Transformer-based language models: BERT, RoBERTa, and ELECTRA
MLM task performed by replacing target words with [MASK] token
Experiments conducted on both pre-trained and fine-tuned versions of each model
Fine-tuning allows model to concentrate on most helpful words for downstream task
Accuracy of 0.96 for pre-trained and 0.99 for fine-tuned BERT

Baselines

Existing context mixing methods are included in experiments
Scores for each context token are selected from the mth row of the context mixing map
Scores are normalized to be positive values and sum to one
Gradient-based attribution methods are included in comparisons
Random scores, raw attention scores, norm-based method, and gradient-based attribution scores are considered

Evaluation 1: cue alignment

Cue words are the only indicators of true labels in dataset
Model performance depends on cue words to form representation of [MASK] token
Two ways to quantify alignment between cue vector and prediction of context mixing score: dot product and average precision
Raw self-attention weights (Attn) always perform worse than random scores
Attn-norm shows improvement when norm of transformed value vectors taken into account
Gradient-based scores highlight cue words only in earlier layers
Layer-wise probing experiment will show scores are not reliable for identifying relevant context in individual layers

Evaluation 2: context mixing versus probing

Cue word alignment and probing performance are related
Representations of masked tokens are associated with Singular or Plural labels
Minimum Description Length (MDL) is used to measure degree to which representations encode number agreement
Compression metric is used to evaluate probing performance
Value Zeroing is highly positively correlated with probing performance

Evaluation 3: faithfulness analysis

Value Zeroing score matches linguistically-informed expectations
Not always clear if context mixing score matches human expectations
Input ablation used to evaluate faithfulness of context mixing score
Influence of target token estimated by drop in model’s predicted probability
Higher drop for ablated token indicates token is more influential
Blank-out scores compared to context mixing scores
Highest correlation for Value Zeroing indicates it is more faithful
Value Zeroing mainly relies on main subject as cue word
Gradient-based methods highlight agreement attractor
Attention-based scores focus on [CLS] token

Discussion

Lack of standard ground truth makes evaluating explanation and analysis methods challenging
Evaluating context mixing scores is even more difficult
Several studies have used gradient-based scores as an anchor of faithfulness
Controlled tasks with strong prior expectations can be used to evaluate these methods
Kobayashi et al. (2021) raised concern that BERT tends to preserve token representations rather than mixing them at each layer

Conclusion

Propose Value Zeroing as a novel approach for quantifying information mixing in Transformers
Outperforms other methods in 3 different evaluation setups
Requires no supervision
Can improve model efficiency by removing token representations
No gold standard for interpreting Deep Learning models
Customized for Transformer architecture
Evaluations based on Text modality
Impact of selecting different distance metrics
Normalizing representations does not affect scores
Same pattern observed for different distance metrics
Consistently outperforms other methods on all models

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Background and notation#

Value zeroing#

Data#

Target model#

Baselines#

Evaluation 1: cue alignment#

Evaluation 2: context mixing versus probing#

Evaluation 3: faithfulness analysis#

Discussion#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Background and notation

Value zeroing

Data

Target model

Baselines

Evaluation 1: cue alignment

Evaluation 2: context mixing versus probing

Evaluation 3: faithfulness analysis

Discussion

Conclusion