Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Training data attribution (TDA) methods trace a model’s prediction back to specific influential training examples.
- Existing approaches assign a scalar influence score to each training example, assuming influence is additive.
- Simfluence is a new paradigm for TDA which produces a training run simulator to study non-additive interactions.
- Simfluence-Linear captures non-additive interactions and predicts the spiky trajectory of individual example losses.
- Experiments show Simfluence-Linear predicts loss trajectories with higher accuracy than existing TDA methods.
Paper Content
Introduction
- Important advances in machine learning are often made possible by better or more data
- We compare true observed loss trajectories with predicted trajectories
- Many of the ups and downs in the true loss trajectories can be anticipated
- Amount of change in loss on z test is called influence of z on z test
- Common assumption of influence being additive does not capture non-additive aspects of real training
- Redundancy between two examples is a special case of submodular interaction
- Proposed Simfluence paradigm for TDA where output is a training run simulator
- Simfluence-Linear models non-additive effects and outperforms existing methods
Task
- Simfluence is the task of training run simulation
- A training run simulator takes two inputs: the curriculum and the initial loss before training begins
- The output is a predicted loss for every time step
- There is a trade-off between simulation accuracy and speed
- A useful simulator should be orders of magnitude faster to run
Training a simulator
- Training runs are used to learn a simulator
- Each run provides an (input, output) pair for training
- Previous runs are recorded and used to predict future loss trajectories
- Early work assumed access to only one run, more recent work uses multiple runs
Evaluating a simulator
- Evaluate a simulator by comparing predicted losses with true observed losses
- Measure accuracy using mean squared error
- Evaluate simulators that don’t provide good absolute prediction using Spearman’s correlation
- Goal of TDA is to match real observed training runs
Approach
Simulator
- Models the loss trajectory of a single test example as a linear Markov process
- Loss at time t is a linear function of the training examples consumed at that step
- Multiplicative factor α(c t ) and additive factor β(c t ) are learned functions of c t
- Multiplicative factor α(c t ) enables modeling of redundancy between training examples and the effect of training order
- Simfluence-Linear, Simfluence-Additive and Simfluence-Multiplicative are introduced
- Simfluence-Additive models additive influence, but not multiplicative influence
Learning the simulator
- Objective is to minimize L2-regularized regression
- When batch size is 1, objective reduces to univariate linear regression problem with closed form solution
Data requirements
- Simfluence-Linear requires 2n data points for a unique solution.
- 20n to 60n training steps are sufficient for Simfluence-Linear.
- Computational cost is the limiting factor for Simfluence-Linear.
Generality beyond gradient-based learning
- Simfluence-Linear is a method to simulate gradient descent on a loss function.
- It does not require access to model gradients or parameters.
- It can be used to predict the trajectory of any arbitrary metric.
- It is designed to simulate algorithms that incrementally consume examples.
- It connects to research on credit assignment.
Connections to prior tda methods
- TracIn (Pruthi et al., 2020) and influence functions (Koh & Liang, 2017) can be viewed as simulators under the Simfluence paradigm
- TracIn-Ideal is equivalent to Simfluence-Additive, a simple simulator that only models additive influence
- Expected TracIn-Ideal is encompassed by Simfluence-Additive
- TracIn-CP is a computationally cheaper approximation to TracIn-Ideal
- TracIn-CP makes two approximations: summing over steps where model checkpoints are saved and replacing the actual observed loss reduction with a hypothetical loss reduction
- Hypothetical losses can be incorporated into Simfluence-Linear and Simfluence-Additive
- TracIn-CP is equivalent to Hypothetical Simfluence-Additive and can be viewed as a purely additive simulator
Influence functions
- Influence functions model the influence of example zi on example z.
- Simfluence and influence functions are connected by hypothetical training steps.
- Second-order gradient descent is used in the Taylor series approximation.
- Influence functions are equivalent to Hypothetical Simfluence-Additive when using second-order hypothetical losses.
Experiments
- Proposed approach (Simfluence-Linear) and prior TDA methods used as training run simulators
- Evaluated and compared simulation accuracy using two metrics: all-steps MSE and final-step Spearman’s ρ
- Simulated large language model (LLM) fine-tuning on three datasets: RTE, COPA, and Winogrande
- Two LLM fine-tuning methods: standard full-model tuning and parameter-efficient tuning
- 32 training runs, 20 for fitting simulator parameters and 2 as validation set
- TracIn-CP uses approximation L and requires computing and storing model gradients for every example at every checkpoint
- Simfluence-Linear outperforms TracIn-CP, even when optimally rescaled
- Simfluence-Linear outperforms Simfluence-Additive and Simfluence-Multiplicative
- Simfluence-Linear best at simulating T0 IA3 fine-tuning, does a little worse on T0 full-model tuning, and the least well on T5-LMA full-model tuning
Related work
- Φ(z) and Ψ(z) are neural network encoders that map examples to a low-dimensional vector
- Any kind of learned model can be used to parameterize both α(c t ) and β(c t)
- Simfluence-Linear cannot simulate training runs that include examples which have never been seen before
- Simfluence-Linear oversimplifies the real dynamics of training
- Simfluence-Additive fits a model to actual observed losses, while TracIn-CP uses hypothetical losses
- Simfluence-Additive requires one loss reduction per model parameter
- TDA methods seek to identify which training examples caused the biggest change in the model’s loss on a test example
- Simfluence is a new paradigm for TDA
- Simfluence-Additive outperforms all TracIn variants
- Simfluence is closely related to other TDA methods
- Simfluence is based on the full loss trajectory over the course of training
- Simfluence has been applied to explain predictions and identify data artifacts in NLP tasks
- Simfluence is used as a heuristic to identify training examples to remove in order to flip a target prediction