  • Self-attention weights are used to analyze token-to-token interactions in Transformer-based models.
  • Other components in the encoder layer can also affect information mixing in the output representations.
  • Value Zeroing is a novel context mixing score customized for Transformers that provides a deeper understanding of how information is mixed.
  • Evaluations are done with different view points based on linguistically informed rationales, probing, and faithfulness analysis.

Paper Content


  • Transformers are used to learn contextualized representations across a range of modalities
  • Attention weights are used to understand the information flow from the input embeddings to the output representation
  • Attention weights can be unreliable and uninformative
  • Value Zeroing is a novel approach to quantify the contribution of context tokens to the output representation
  • Value Zeroing is evaluated using grammatical agreement tasks, information-theoretic probing, and faithfulness to model decisions
  • Value Zeroing is more plausible, human-interpretable, and faithful to models’ decisions than other analysis methods
  • Numerous studies have used self-attention weights to gain insight into Transformers
  • Debate exists on whether attention weights are suitable for interpreting the model
  • Post-processing interpretability techniques have been proposed to convert weights into scores
  • Abnar and Zuidema (2020) proposed attention-flow and attention-rollout methods
  • Kobayashi et al. (2020) proposed a method incorporating the norm of transformed value vectors
  • Kobayashi et al. (2021) extended the method to the whole self-attention block
  • Brunner et al. (2020) and Pascual et al. (2021) used a gradient-based approach
  • Modarressi et al. (2022) and Ferrando et al. (2022) combined existing approaches
  • New context mixing score proposed to take into account all components in a Transformer encoder block

Background and notation

  • Transformer encoder layer is composed of two sublayers: multi-head self-attention mechanism (MHA) and position-wise fully connected feed-forward network (FFN).
  • Each input vector xi is transformed into a query qhi, a key khi, and a value vhi vector via separate trainable linear transformations.
  • Context vector zhi for the ith token is generated as a weighted sum over the transformed value vectors.
  • Output token representations xi are produced by two linear transformations with a ReLU activation in between.

Value zeroing

  • Aim to measure how much a token uses other context tokens to build its output representation
  • Treat self-attention mechanism as a fuzzy hash-table
  • Replace value vector associated with token j with a zero vector
  • Compare alternative output representation with original to measure how much output representation is affected by exclusion of j th token
  • Compute Value Zeroing matrix score to understand context mixing process


  • Used BLiMP benchmark to evaluate context mixing scores
  • Benchmark isolates linguistic phenomena so only one word determines true label of each sentence
  • Selected five datasets with three different linguistic phenomena
  • Expanded contractions and generated dependency trees using SpaCy
  • Unified dataset for grammatical agreement with 4,276 data points, divided into Train and Test sets

Target model

  • Experiments conducted on 3 Transformer-based language models: BERT, RoBERTa, and ELECTRA
  • MLM task performed by replacing target words with [MASK] token
  • Experiments conducted on both pre-trained and fine-tuned versions of each model
  • Fine-tuning allows model to concentrate on most helpful words for downstream task
  • Accuracy of 0.96 for pre-trained and 0.99 for fine-tuned BERT


  • Existing context mixing methods are included in experiments
  • Scores for each context token are selected from the mth row of the context mixing map
  • Scores are normalized to be positive values and sum to one
  • Gradient-based attribution methods are included in comparisons
  • Random scores, raw attention scores, norm-based method, and gradient-based attribution scores are considered

Evaluation 1: cue alignment

  • Cue words are the only indicators of true labels in dataset
  • Model performance depends on cue words to form representation of [MASK] token
  • Two ways to quantify alignment between cue vector and prediction of context mixing score: dot product and average precision
  • Raw self-attention weights (Attn) always perform worse than random scores
  • Attn-norm shows improvement when norm of transformed value vectors taken into account
  • Gradient-based scores highlight cue words only in earlier layers
  • Layer-wise probing experiment will show scores are not reliable for identifying relevant context in individual layers

Evaluation 2: context mixing versus probing

  • Cue word alignment and probing performance are related
  • Representations of masked tokens are associated with Singular or Plural labels
  • Minimum Description Length (MDL) is used to measure degree to which representations encode number agreement
  • Compression metric is used to evaluate probing performance
  • Value Zeroing is highly positively correlated with probing performance

Evaluation 3: faithfulness analysis

  • Value Zeroing score matches linguistically-informed expectations
  • Not always clear if context mixing score matches human expectations
  • Input ablation used to evaluate faithfulness of context mixing score
  • Influence of target token estimated by drop in model’s predicted probability
  • Higher drop for ablated token indicates token is more influential
  • Blank-out scores compared to context mixing scores
  • Highest correlation for Value Zeroing indicates it is more faithful
  • Value Zeroing mainly relies on main subject as cue word
  • Gradient-based methods highlight agreement attractor
  • Attention-based scores focus on [CLS] token


  • Lack of standard ground truth makes evaluating explanation and analysis methods challenging
  • Evaluating context mixing scores is even more difficult
  • Several studies have used gradient-based scores as an anchor of faithfulness
  • Controlled tasks with strong prior expectations can be used to evaluate these methods
  • Kobayashi et al. (2021) raised concern that BERT tends to preserve token representations rather than mixing them at each layer


  • Propose Value Zeroing as a novel approach for quantifying information mixing in Transformers
  • Outperforms other methods in 3 different evaluation setups
  • Requires no supervision
  • Can improve model efficiency by removing token representations
  • No gold standard for interpreting Deep Learning models
  • Customized for Transformer architecture
  • Evaluations based on Text modality
  • Impact of selecting different distance metrics
  • Normalizing representations does not affect scores
  • Same pattern observed for different distance metrics
  • Consistently outperforms other methods on all models