Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Language models learn a lot of information during pretraining
Facts can be stored in different locations than previously thought
Past work on model editing methods relies on Causal Tracing to select which model layers to edit
Experiments show that editing performance relates to localization results from representation denoising
Better understanding of language models may not always translate to insights about how to best change their behavior

Paper Content

Introduction

Language models learn facts during pretraining
Recent work explores how facts are stored in model weights
Model editing methods can be used to inject new facts into model weights
Connection between localization and editing is based on assumption that one should edit a model by localizing a behavior and then editing that component
Causal Tracing measures information content of hidden representations and is used to motivate ROME and MEMIT model-editing methods
Correlation between Causal Tracing results and edit success is near zero
Better to ignore tracing results and always choose early-to-mid-layer MLP weight for editing
Tracing effects explain only a small fraction of variance in editing performance

Localization methods focus on model components such as layers, neurons, and weight matrices
MLP layers are studied for their role in factual association
Localization is validated by editing neuron activations or layer weights
Editing suggested locations does not show if it is necessary or the best option
Recent work has looked into editing success across layers
Investigating edit success at the datapoint level reveals unexpected results

Notation and background

Data notation

Consider facts of the form (s, r, o)
Prompt P for some fact (s, r, o)
Variations of the data for the fact (s, r, o)
s* is a “neighboring” entity to the subject s
r* is a paraphrase of the relation r
s noise is a noised representation of the subject s
o false is an object that incorrectly completes the tuple (s, r, •)
o true is the object that correctly completes the fact (s, r, •)

Causal tracing

Causal Tracing is a method for localizing information in the forward pass of an autoregressive Transformer.
It estimates the amount of information about a fact contained in each of the representations produced by the forward pass.
A tracing window size of 5 is used by default, and it is applied exclusively to MLP layers.

Model editing with rome

ROME editing method is described
Additional editing methods are outlined
Mathematical detail can be found in Meng et al. (2022a)
Desirable model edit is described in Sec. 3.4

Editing metrics

Evaluate editing methods based on ability to change model prediction, generalize to paraphrases, and avoid over-generalizing
Use metrics from CounterFact data and normalize scores to make them comparable
Rewrite score measures how much edit improves target probability
Paraphrase score measures target probability using syntactical paraphrases
Neighborhood score measures whether edits change predictions for similar prompts

Does edit success follow from localization?

Localization results should inform editing methods.
Knowing where information is stored in a model can help manipulate the model’s expression of that information.
Editing is used to verify the quality of localization analysis.
Investigating the validity of this assumption as it applies to autoregressive Transformers suggests that edit success appears to be unrelated to localization results.

Experiment design

Goal of experiments is to determine if edit success aligns with results from Causal Tracing
Edit Success is measured by Rewrite Score
Tracing effect is taken as the max across token effects
Tracing window size is 5
Results are also tested with other measures of edit success, different tracing window sizes, GPT2-XL, unscaled metrics, and tracing effect at last subject token

Model and data

GPT-J is a 6 billion parameter autoregressive language model
Experiments conducted using CounterFact dataset
Editing performance recorded at 8 layers
ROME achieves an average rewrite score of 99% at layer 6 and above 96% at layers besides layer 28
CounterFact dataset includes datapoints consisting of a prompt, paraphrases, and neighboring points
10% of CounterFact dataset used for experiments, filtered to a subset of facts correctly completed by GPT-J
Final sample size is 652

Experiment results

Tracing effects explain at most 3.2% of the variance in edit success
Tracing effects explain 58.5% of the variance in the outcome on average
Tracing effects are weakly informative of Fact Forcing editing
Tracing effects explain 3% more of the variance in edit success for Fact Forcing
Tracing effects are unrelated to editing success

Reconciling localization and editing

Propose variants of the model editing problem that are more closely related to insights from tracing analysis
Repeat and extend analysis from previous section for all editing problems

Editing problem variants

Fig. 5 summarizes editing problems
Solutions to each problem are evaluated using Rewrite Score, Paraphrase Score, and Neighborhood Score metrics

Experiment design and additional edit methods

Experiment uses 4 different methods to edit MLP layers
ROME edits single layer, MEMIT spreads update over several layers, Constrained Finetuning (window size 1) applies to single layer, Constrained Finetuning (window size 5) applies to five adjacent layers

Discussion

Causal Tracing is not indicative of which layer to select for model editing
Causal Tracing has helped reveal the role of early-to-mid-range MLP representations in factual association
ROME performs better on average when optimizing the last subject token representation
Editing MLP weights is preferable to editing other weights with large Causal Tracing effects
Information is gradually accumulated across layers in a Transformer forward pass
It is possible to “override” the information in a layer with an edit to another layer
Causal Tracing answers a different question than model editing does

Conclusion

Model edit success is unrelated to where factual information is stored in models.
Four variants of the Error Injection problem are introduced using the CounterFact dataset.
Edit success and tracing effects correlate best in the Fact Forcing setting.
Tracing effects explain only a small fraction of the variance in editing performance.
Better understanding of how pretrained language models work may not always translate to insights about how to best change their behavior.
Data is filtered to a subset of facts correctly completed by GPT-J.
Edit methods are tuned to have high rewrite scores.
Tracing effects are not predictive of editing performance even when facts appear to be stored in a small number of layers.
Essence drift is measured by calculating the change in model perplexity over samples of text.
Results are similar for other measures of edit success, different tracing window sizes, GPT2-XL, unscaled metrics, and tracing effect at the last subject token.

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Notation and background#

Data notation#

Causal tracing#

Model editing with rome#

Editing metrics#

Does edit success follow from localization?#

Experiment design#

Model and data#

Experiment results#

Reconciling localization and editing#

Editing problem variants#

Experiment design and additional edit methods#

Discussion#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Notation and background

Data notation

Causal tracing

Model editing with rome

Editing metrics

Does edit success follow from localization?

Experiment design

Model and data

Experiment results

Reconciling localization and editing

Editing problem variants

Experiment design and additional edit methods

Discussion

Conclusion