Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Training data attribution (TDA) methods trace a model’s prediction back to specific influential training examples.
Existing approaches assign a scalar influence score to each training example, assuming influence is additive.
Simfluence is a new paradigm for TDA which produces a training run simulator to study non-additive interactions.
Simfluence-Linear captures non-additive interactions and predicts the spiky trajectory of individual example losses.
Experiments show Simfluence-Linear predicts loss trajectories with higher accuracy than existing TDA methods.

Paper Content

Introduction

Important advances in machine learning are often made possible by better or more data
We compare true observed loss trajectories with predicted trajectories
Many of the ups and downs in the true loss trajectories can be anticipated
Amount of change in loss on z test is called influence of z on z test
Common assumption of influence being additive does not capture non-additive aspects of real training
Redundancy between two examples is a special case of submodular interaction
Proposed Simfluence paradigm for TDA where output is a training run simulator
Simfluence-Linear models non-additive effects and outperforms existing methods

Task

Simfluence is the task of training run simulation
A training run simulator takes two inputs: the curriculum and the initial loss before training begins
The output is a predicted loss for every time step
There is a trade-off between simulation accuracy and speed
A useful simulator should be orders of magnitude faster to run

Training a simulator

Training runs are used to learn a simulator
Each run provides an (input, output) pair for training
Previous runs are recorded and used to predict future loss trajectories
Early work assumed access to only one run, more recent work uses multiple runs

Evaluating a simulator

Evaluate a simulator by comparing predicted losses with true observed losses
Measure accuracy using mean squared error
Evaluate simulators that don’t provide good absolute prediction using Spearman’s correlation
Goal of TDA is to match real observed training runs

Approach

Simulator

Models the loss trajectory of a single test example as a linear Markov process
Loss at time t is a linear function of the training examples consumed at that step
Multiplicative factor α(c t ) and additive factor β(c t ) are learned functions of c t
Multiplicative factor α(c t ) enables modeling of redundancy between training examples and the effect of training order
Simfluence-Linear, Simfluence-Additive and Simfluence-Multiplicative are introduced
Simfluence-Additive models additive influence, but not multiplicative influence

Learning the simulator

Objective is to minimize L2-regularized regression
When batch size is 1, objective reduces to univariate linear regression problem with closed form solution

Data requirements

Simfluence-Linear requires 2n data points for a unique solution.
20n to 60n training steps are sufficient for Simfluence-Linear.
Computational cost is the limiting factor for Simfluence-Linear.

Generality beyond gradient-based learning

Simfluence-Linear is a method to simulate gradient descent on a loss function.
It does not require access to model gradients or parameters.
It can be used to predict the trajectory of any arbitrary metric.
It is designed to simulate algorithms that incrementally consume examples.
It connects to research on credit assignment.

Connections to prior tda methods

TracIn (Pruthi et al., 2020) and influence functions (Koh & Liang, 2017) can be viewed as simulators under the Simfluence paradigm
TracIn-Ideal is equivalent to Simfluence-Additive, a simple simulator that only models additive influence
Expected TracIn-Ideal is encompassed by Simfluence-Additive
TracIn-CP is a computationally cheaper approximation to TracIn-Ideal
TracIn-CP makes two approximations: summing over steps where model checkpoints are saved and replacing the actual observed loss reduction with a hypothetical loss reduction
Hypothetical losses can be incorporated into Simfluence-Linear and Simfluence-Additive
TracIn-CP is equivalent to Hypothetical Simfluence-Additive and can be viewed as a purely additive simulator

Influence functions

Influence functions model the influence of example zi on example z.
Simfluence and influence functions are connected by hypothetical training steps.
Second-order gradient descent is used in the Taylor series approximation.
Influence functions are equivalent to Hypothetical Simfluence-Additive when using second-order hypothetical losses.

Experiments

Proposed approach (Simfluence-Linear) and prior TDA methods used as training run simulators
Evaluated and compared simulation accuracy using two metrics: all-steps MSE and final-step Spearman’s ρ
Simulated large language model (LLM) fine-tuning on three datasets: RTE, COPA, and Winogrande
Two LLM fine-tuning methods: standard full-model tuning and parameter-efficient tuning
32 training runs, 20 for fitting simulator parameters and 2 as validation set
TracIn-CP uses approximation L and requires computing and storing model gradients for every example at every checkpoint
Simfluence-Linear outperforms TracIn-CP, even when optimally rescaled
Simfluence-Linear outperforms Simfluence-Additive and Simfluence-Multiplicative
Simfluence-Linear best at simulating T0 IA3 fine-tuning, does a little worse on T0 full-model tuning, and the least well on T5-LMA full-model tuning

Φ(z) and Ψ(z) are neural network encoders that map examples to a low-dimensional vector
Any kind of learned model can be used to parameterize both α(c t ) and β(c t)
Simfluence-Linear cannot simulate training runs that include examples which have never been seen before
Simfluence-Linear oversimplifies the real dynamics of training
Simfluence-Additive fits a model to actual observed losses, while TracIn-CP uses hypothetical losses
Simfluence-Additive requires one loss reduction per model parameter
TDA methods seek to identify which training examples caused the biggest change in the model’s loss on a test example
Simfluence is a new paradigm for TDA
Simfluence-Additive outperforms all TracIn variants
Simfluence is closely related to other TDA methods
Simfluence is based on the full loss trajectory over the course of training
Simfluence has been applied to explain predictions and identify data artifacts in NLP tasks
Simfluence is used as a heuristic to identify training examples to remove in order to flip a target prediction

Link to paper#

Abstract#

Paper Content#

Introduction#

Task#

Training a simulator#

Evaluating a simulator#

Approach#

Simulator#

Learning the simulator#

Data requirements#

Generality beyond gradient-based learning#

Connections to prior tda methods#

Influence functions#

Experiments#

Related work#

Link to paper

Abstract

Paper Content

Introduction

Task

Training a simulator

Evaluating a simulator

Approach

Simulator

Learning the simulator

Data requirements

Generality beyond gradient-based learning

Connections to prior tda methods

Influence functions

Experiments

Related work