Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Language models learn a lot of information during pretraining
  • Facts can be stored in different locations than previously thought
  • Past work on model editing methods relies on Causal Tracing to select which model layers to edit
  • Experiments show that editing performance relates to localization results from representation denoising
  • Better understanding of language models may not always translate to insights about how to best change their behavior

Paper Content


  • Language models learn facts during pretraining
  • Recent work explores how facts are stored in model weights
  • Model editing methods can be used to inject new facts into model weights
  • Connection between localization and editing is based on assumption that one should edit a model by localizing a behavior and then editing that component
  • Causal Tracing measures information content of hidden representations and is used to motivate ROME and MEMIT model-editing methods
  • Correlation between Causal Tracing results and edit success is near zero
  • Better to ignore tracing results and always choose early-to-mid-layer MLP weight for editing
  • Tracing effects explain only a small fraction of variance in editing performance
  • Localization methods focus on model components such as layers, neurons, and weight matrices
  • MLP layers are studied for their role in factual association
  • Localization is validated by editing neuron activations or layer weights
  • Editing suggested locations does not show if it is necessary or the best option
  • Recent work has looked into editing success across layers
  • Investigating edit success at the datapoint level reveals unexpected results

Notation and background

Data notation

  • Consider facts of the form (s, r, o)
  • Prompt P for some fact (s, r, o)
  • Variations of the data for the fact (s, r, o)
  • s* is a “neighboring” entity to the subject s
  • r* is a paraphrase of the relation r
  • s noise is a noised representation of the subject s
  • o false is an object that incorrectly completes the tuple (s, r, •)
  • o true is the object that correctly completes the fact (s, r, •)

Causal tracing

  • Causal Tracing is a method for localizing information in the forward pass of an autoregressive Transformer.
  • It estimates the amount of information about a fact contained in each of the representations produced by the forward pass.
  • A tracing window size of 5 is used by default, and it is applied exclusively to MLP layers.

Model editing with rome

  • ROME editing method is described
  • Additional editing methods are outlined
  • Mathematical detail can be found in Meng et al. (2022a)
  • Desirable model edit is described in Sec. 3.4

Editing metrics

  • Evaluate editing methods based on ability to change model prediction, generalize to paraphrases, and avoid over-generalizing
  • Use metrics from CounterFact data and normalize scores to make them comparable
  • Rewrite score measures how much edit improves target probability
  • Paraphrase score measures target probability using syntactical paraphrases
  • Neighborhood score measures whether edits change predictions for similar prompts

Does edit success follow from localization?

  • Localization results should inform editing methods.
  • Knowing where information is stored in a model can help manipulate the model’s expression of that information.
  • Editing is used to verify the quality of localization analysis.
  • Investigating the validity of this assumption as it applies to autoregressive Transformers suggests that edit success appears to be unrelated to localization results.

Experiment design

  • Goal of experiments is to determine if edit success aligns with results from Causal Tracing
  • Edit Success is measured by Rewrite Score
  • Tracing effect is taken as the max across token effects
  • Tracing window size is 5
  • Results are also tested with other measures of edit success, different tracing window sizes, GPT2-XL, unscaled metrics, and tracing effect at last subject token

Model and data

  • GPT-J is a 6 billion parameter autoregressive language model
  • Experiments conducted using CounterFact dataset
  • Editing performance recorded at 8 layers
  • ROME achieves an average rewrite score of 99% at layer 6 and above 96% at layers besides layer 28
  • CounterFact dataset includes datapoints consisting of a prompt, paraphrases, and neighboring points
  • 10% of CounterFact dataset used for experiments, filtered to a subset of facts correctly completed by GPT-J
  • Final sample size is 652

Experiment results

  • Tracing effects explain at most 3.2% of the variance in edit success
  • Tracing effects explain 58.5% of the variance in the outcome on average
  • Tracing effects are weakly informative of Fact Forcing editing
  • Tracing effects explain 3% more of the variance in edit success for Fact Forcing
  • Tracing effects are unrelated to editing success

Reconciling localization and editing

  • Propose variants of the model editing problem that are more closely related to insights from tracing analysis
  • Repeat and extend analysis from previous section for all editing problems

Editing problem variants

  • Fig. 5 summarizes editing problems
  • Solutions to each problem are evaluated using Rewrite Score, Paraphrase Score, and Neighborhood Score metrics

Experiment design and additional edit methods

  • Experiment uses 4 different methods to edit MLP layers
  • ROME edits single layer, MEMIT spreads update over several layers, Constrained Finetuning (window size 1) applies to single layer, Constrained Finetuning (window size 5) applies to five adjacent layers


  • Causal Tracing is not indicative of which layer to select for model editing
  • Causal Tracing has helped reveal the role of early-to-mid-range MLP representations in factual association
  • ROME performs better on average when optimizing the last subject token representation
  • Editing MLP weights is preferable to editing other weights with large Causal Tracing effects
  • Information is gradually accumulated across layers in a Transformer forward pass
  • It is possible to “override” the information in a layer with an edit to another layer
  • Causal Tracing answers a different question than model editing does


  • Model edit success is unrelated to where factual information is stored in models.
  • Four variants of the Error Injection problem are introduced using the CounterFact dataset.
  • Edit success and tracing effects correlate best in the Fact Forcing setting.
  • Tracing effects explain only a small fraction of the variance in editing performance.
  • Better understanding of how pretrained language models work may not always translate to insights about how to best change their behavior.
  • Data is filtered to a subset of facts correctly completed by GPT-J.
  • Edit methods are tuned to have high rewrite scores.
  • Tracing effects are not predictive of editing performance even when facts appear to be stored in a small number of layers.
  • Essence drift is measured by calculating the change in model perplexity over samples of text.
  • Results are similar for other measures of edit success, different tracing window sizes, GPT2-XL, unscaled metrics, and tracing effect at the last subject token.