Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Self-attention weights are used to analyze token-to-token interactions in Transformer-based models.
- Other components in the encoder layer can also affect information mixing in the output representations.
- Value Zeroing is a novel context mixing score customized for Transformers that provides a deeper understanding of how information is mixed.
- Evaluations are done with different view points based on linguistically informed rationales, probing, and faithfulness analysis.
Paper Content
Introduction
- Transformers are used to learn contextualized representations across a range of modalities
- Attention weights are used to understand the information flow from the input embeddings to the output representation
- Attention weights can be unreliable and uninformative
- Value Zeroing is a novel approach to quantify the contribution of context tokens to the output representation
- Value Zeroing is evaluated using grammatical agreement tasks, information-theoretic probing, and faithfulness to model decisions
- Value Zeroing is more plausible, human-interpretable, and faithful to models’ decisions than other analysis methods
Related work
- Numerous studies have used self-attention weights to gain insight into Transformers
- Debate exists on whether attention weights are suitable for interpreting the model
- Post-processing interpretability techniques have been proposed to convert weights into scores
- Abnar and Zuidema (2020) proposed attention-flow and attention-rollout methods
- Kobayashi et al. (2020) proposed a method incorporating the norm of transformed value vectors
- Kobayashi et al. (2021) extended the method to the whole self-attention block
- Brunner et al. (2020) and Pascual et al. (2021) used a gradient-based approach
- Modarressi et al. (2022) and Ferrando et al. (2022) combined existing approaches
- New context mixing score proposed to take into account all components in a Transformer encoder block
Background and notation
- Transformer encoder layer is composed of two sublayers: multi-head self-attention mechanism (MHA) and position-wise fully connected feed-forward network (FFN).
- Each input vector xi is transformed into a query qhi, a key khi, and a value vhi vector via separate trainable linear transformations.
- Context vector zhi for the ith token is generated as a weighted sum over the transformed value vectors.
- Output token representations xi are produced by two linear transformations with a ReLU activation in between.
Value zeroing
- Aim to measure how much a token uses other context tokens to build its output representation
- Treat self-attention mechanism as a fuzzy hash-table
- Replace value vector associated with token j with a zero vector
- Compare alternative output representation with original to measure how much output representation is affected by exclusion of j th token
- Compute Value Zeroing matrix score to understand context mixing process
Data
- Used BLiMP benchmark to evaluate context mixing scores
- Benchmark isolates linguistic phenomena so only one word determines true label of each sentence
- Selected five datasets with three different linguistic phenomena
- Expanded contractions and generated dependency trees using SpaCy
- Unified dataset for grammatical agreement with 4,276 data points, divided into Train and Test sets
Target model
- Experiments conducted on 3 Transformer-based language models: BERT, RoBERTa, and ELECTRA
- MLM task performed by replacing target words with [MASK] token
- Experiments conducted on both pre-trained and fine-tuned versions of each model
- Fine-tuning allows model to concentrate on most helpful words for downstream task
- Accuracy of 0.96 for pre-trained and 0.99 for fine-tuned BERT
Baselines
- Existing context mixing methods are included in experiments
- Scores for each context token are selected from the mth row of the context mixing map
- Scores are normalized to be positive values and sum to one
- Gradient-based attribution methods are included in comparisons
- Random scores, raw attention scores, norm-based method, and gradient-based attribution scores are considered
Evaluation 1: cue alignment
- Cue words are the only indicators of true labels in dataset
- Model performance depends on cue words to form representation of [MASK] token
- Two ways to quantify alignment between cue vector and prediction of context mixing score: dot product and average precision
- Raw self-attention weights (Attn) always perform worse than random scores
- Attn-norm shows improvement when norm of transformed value vectors taken into account
- Gradient-based scores highlight cue words only in earlier layers
- Layer-wise probing experiment will show scores are not reliable for identifying relevant context in individual layers
Evaluation 2: context mixing versus probing
- Cue word alignment and probing performance are related
- Representations of masked tokens are associated with Singular or Plural labels
- Minimum Description Length (MDL) is used to measure degree to which representations encode number agreement
- Compression metric is used to evaluate probing performance
- Value Zeroing is highly positively correlated with probing performance
Evaluation 3: faithfulness analysis
- Value Zeroing score matches linguistically-informed expectations
- Not always clear if context mixing score matches human expectations
- Input ablation used to evaluate faithfulness of context mixing score
- Influence of target token estimated by drop in model’s predicted probability
- Higher drop for ablated token indicates token is more influential
- Blank-out scores compared to context mixing scores
- Highest correlation for Value Zeroing indicates it is more faithful
- Value Zeroing mainly relies on main subject as cue word
- Gradient-based methods highlight agreement attractor
- Attention-based scores focus on [CLS] token
Discussion
- Lack of standard ground truth makes evaluating explanation and analysis methods challenging
- Evaluating context mixing scores is even more difficult
- Several studies have used gradient-based scores as an anchor of faithfulness
- Controlled tasks with strong prior expectations can be used to evaluate these methods
- Kobayashi et al. (2021) raised concern that BERT tends to preserve token representations rather than mixing them at each layer
Conclusion
- Propose Value Zeroing as a novel approach for quantifying information mixing in Transformers
- Outperforms other methods in 3 different evaluation setups
- Requires no supervision
- Can improve model efficiency by removing token representations
- No gold standard for interpreting Deep Learning models
- Customized for Transformer architecture
- Evaluations based on Text modality
- Impact of selecting different distance metrics
- Normalizing representations does not affect scores
- Same pattern observed for different distance metrics
- Consistently outperforms other methods on all models