Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Language models learn a lot of information during pretraining
- Facts can be stored in different locations than previously thought
- Past work on model editing methods relies on Causal Tracing to select which model layers to edit
- Experiments show that editing performance relates to localization results from representation denoising
- Better understanding of language models may not always translate to insights about how to best change their behavior
Paper Content
Introduction
- Language models learn facts during pretraining
- Recent work explores how facts are stored in model weights
- Model editing methods can be used to inject new facts into model weights
- Connection between localization and editing is based on assumption that one should edit a model by localizing a behavior and then editing that component
- Causal Tracing measures information content of hidden representations and is used to motivate ROME and MEMIT model-editing methods
- Correlation between Causal Tracing results and edit success is near zero
- Better to ignore tracing results and always choose early-to-mid-layer MLP weight for editing
- Tracing effects explain only a small fraction of variance in editing performance
Related work
- Localization methods focus on model components such as layers, neurons, and weight matrices
- MLP layers are studied for their role in factual association
- Localization is validated by editing neuron activations or layer weights
- Editing suggested locations does not show if it is necessary or the best option
- Recent work has looked into editing success across layers
- Investigating edit success at the datapoint level reveals unexpected results
Notation and background
Data notation
- Consider facts of the form (s, r, o)
- Prompt P for some fact (s, r, o)
- Variations of the data for the fact (s, r, o)
- s* is a “neighboring” entity to the subject s
- r* is a paraphrase of the relation r
- s noise is a noised representation of the subject s
- o false is an object that incorrectly completes the tuple (s, r, •)
- o true is the object that correctly completes the fact (s, r, •)
Causal tracing
- Causal Tracing is a method for localizing information in the forward pass of an autoregressive Transformer.
- It estimates the amount of information about a fact contained in each of the representations produced by the forward pass.
- A tracing window size of 5 is used by default, and it is applied exclusively to MLP layers.
Model editing with rome
- ROME editing method is described
- Additional editing methods are outlined
- Mathematical detail can be found in Meng et al. (2022a)
- Desirable model edit is described in Sec. 3.4
Editing metrics
- Evaluate editing methods based on ability to change model prediction, generalize to paraphrases, and avoid over-generalizing
- Use metrics from CounterFact data and normalize scores to make them comparable
- Rewrite score measures how much edit improves target probability
- Paraphrase score measures target probability using syntactical paraphrases
- Neighborhood score measures whether edits change predictions for similar prompts
Does edit success follow from localization?
- Localization results should inform editing methods.
- Knowing where information is stored in a model can help manipulate the model’s expression of that information.
- Editing is used to verify the quality of localization analysis.
- Investigating the validity of this assumption as it applies to autoregressive Transformers suggests that edit success appears to be unrelated to localization results.
Experiment design
- Goal of experiments is to determine if edit success aligns with results from Causal Tracing
- Edit Success is measured by Rewrite Score
- Tracing effect is taken as the max across token effects
- Tracing window size is 5
- Results are also tested with other measures of edit success, different tracing window sizes, GPT2-XL, unscaled metrics, and tracing effect at last subject token
Model and data
- GPT-J is a 6 billion parameter autoregressive language model
- Experiments conducted using CounterFact dataset
- Editing performance recorded at 8 layers
- ROME achieves an average rewrite score of 99% at layer 6 and above 96% at layers besides layer 28
- CounterFact dataset includes datapoints consisting of a prompt, paraphrases, and neighboring points
- 10% of CounterFact dataset used for experiments, filtered to a subset of facts correctly completed by GPT-J
- Final sample size is 652
Experiment results
- Tracing effects explain at most 3.2% of the variance in edit success
- Tracing effects explain 58.5% of the variance in the outcome on average
- Tracing effects are weakly informative of Fact Forcing editing
- Tracing effects explain 3% more of the variance in edit success for Fact Forcing
- Tracing effects are unrelated to editing success
Reconciling localization and editing
- Propose variants of the model editing problem that are more closely related to insights from tracing analysis
- Repeat and extend analysis from previous section for all editing problems
Editing problem variants
- Fig. 5 summarizes editing problems
- Solutions to each problem are evaluated using Rewrite Score, Paraphrase Score, and Neighborhood Score metrics
Experiment design and additional edit methods
- Experiment uses 4 different methods to edit MLP layers
- ROME edits single layer, MEMIT spreads update over several layers, Constrained Finetuning (window size 1) applies to single layer, Constrained Finetuning (window size 5) applies to five adjacent layers
Discussion
- Causal Tracing is not indicative of which layer to select for model editing
- Causal Tracing has helped reveal the role of early-to-mid-range MLP representations in factual association
- ROME performs better on average when optimizing the last subject token representation
- Editing MLP weights is preferable to editing other weights with large Causal Tracing effects
- Information is gradually accumulated across layers in a Transformer forward pass
- It is possible to “override” the information in a layer with an edit to another layer
- Causal Tracing answers a different question than model editing does
Conclusion
- Model edit success is unrelated to where factual information is stored in models.
- Four variants of the Error Injection problem are introduced using the CounterFact dataset.
- Edit success and tracing effects correlate best in the Fact Forcing setting.
- Tracing effects explain only a small fraction of the variance in editing performance.
- Better understanding of how pretrained language models work may not always translate to insights about how to best change their behavior.
- Data is filtered to a subset of facts correctly completed by GPT-J.
- Edit methods are tuned to have high rewrite scores.
- Tracing effects are not predictive of editing performance even when facts appear to be stored in a small number of layers.
- Essence drift is measured by calculating the change in model perplexity over samples of text.
- Results are similar for other measures of edit success, different tracing window sizes, GPT2-XL, unscaled metrics, and tracing effect at the last subject token.