Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Training data attribution (TDA) methods trace a model’s prediction back to specific influential training examples.
  • Existing approaches assign a scalar influence score to each training example, assuming influence is additive.
  • Simfluence is a new paradigm for TDA which produces a training run simulator to study non-additive interactions.
  • Simfluence-Linear captures non-additive interactions and predicts the spiky trajectory of individual example losses.
  • Experiments show Simfluence-Linear predicts loss trajectories with higher accuracy than existing TDA methods.

Paper Content

Introduction

  • Important advances in machine learning are often made possible by better or more data
  • We compare true observed loss trajectories with predicted trajectories
  • Many of the ups and downs in the true loss trajectories can be anticipated
  • Amount of change in loss on z test is called influence of z on z test
  • Common assumption of influence being additive does not capture non-additive aspects of real training
  • Redundancy between two examples is a special case of submodular interaction
  • Proposed Simfluence paradigm for TDA where output is a training run simulator
  • Simfluence-Linear models non-additive effects and outperforms existing methods

Task

  • Simfluence is the task of training run simulation
  • A training run simulator takes two inputs: the curriculum and the initial loss before training begins
  • The output is a predicted loss for every time step
  • There is a trade-off between simulation accuracy and speed
  • A useful simulator should be orders of magnitude faster to run

Training a simulator

  • Training runs are used to learn a simulator
  • Each run provides an (input, output) pair for training
  • Previous runs are recorded and used to predict future loss trajectories
  • Early work assumed access to only one run, more recent work uses multiple runs

Evaluating a simulator

  • Evaluate a simulator by comparing predicted losses with true observed losses
  • Measure accuracy using mean squared error
  • Evaluate simulators that don’t provide good absolute prediction using Spearman’s correlation
  • Goal of TDA is to match real observed training runs

Approach

Simulator

  • Models the loss trajectory of a single test example as a linear Markov process
  • Loss at time t is a linear function of the training examples consumed at that step
  • Multiplicative factor α(c t ) and additive factor β(c t ) are learned functions of c t
  • Multiplicative factor α(c t ) enables modeling of redundancy between training examples and the effect of training order
  • Simfluence-Linear, Simfluence-Additive and Simfluence-Multiplicative are introduced
  • Simfluence-Additive models additive influence, but not multiplicative influence

Learning the simulator

  • Objective is to minimize L2-regularized regression
  • When batch size is 1, objective reduces to univariate linear regression problem with closed form solution

Data requirements

  • Simfluence-Linear requires 2n data points for a unique solution.
  • 20n to 60n training steps are sufficient for Simfluence-Linear.
  • Computational cost is the limiting factor for Simfluence-Linear.

Generality beyond gradient-based learning

  • Simfluence-Linear is a method to simulate gradient descent on a loss function.
  • It does not require access to model gradients or parameters.
  • It can be used to predict the trajectory of any arbitrary metric.
  • It is designed to simulate algorithms that incrementally consume examples.
  • It connects to research on credit assignment.

Connections to prior tda methods

  • TracIn (Pruthi et al., 2020) and influence functions (Koh & Liang, 2017) can be viewed as simulators under the Simfluence paradigm
  • TracIn-Ideal is equivalent to Simfluence-Additive, a simple simulator that only models additive influence
  • Expected TracIn-Ideal is encompassed by Simfluence-Additive
  • TracIn-CP is a computationally cheaper approximation to TracIn-Ideal
  • TracIn-CP makes two approximations: summing over steps where model checkpoints are saved and replacing the actual observed loss reduction with a hypothetical loss reduction
  • Hypothetical losses can be incorporated into Simfluence-Linear and Simfluence-Additive
  • TracIn-CP is equivalent to Hypothetical Simfluence-Additive and can be viewed as a purely additive simulator

Influence functions

  • Influence functions model the influence of example zi on example z.
  • Simfluence and influence functions are connected by hypothetical training steps.
  • Second-order gradient descent is used in the Taylor series approximation.
  • Influence functions are equivalent to Hypothetical Simfluence-Additive when using second-order hypothetical losses.

Experiments

  • Proposed approach (Simfluence-Linear) and prior TDA methods used as training run simulators
  • Evaluated and compared simulation accuracy using two metrics: all-steps MSE and final-step Spearman’s ρ
  • Simulated large language model (LLM) fine-tuning on three datasets: RTE, COPA, and Winogrande
  • Two LLM fine-tuning methods: standard full-model tuning and parameter-efficient tuning
  • 32 training runs, 20 for fitting simulator parameters and 2 as validation set
  • TracIn-CP uses approximation L and requires computing and storing model gradients for every example at every checkpoint
  • Simfluence-Linear outperforms TracIn-CP, even when optimally rescaled
  • Simfluence-Linear outperforms Simfluence-Additive and Simfluence-Multiplicative
  • Simfluence-Linear best at simulating T0 IA3 fine-tuning, does a little worse on T0 full-model tuning, and the least well on T5-LMA full-model tuning
  • Φ(z) and Ψ(z) are neural network encoders that map examples to a low-dimensional vector
  • Any kind of learned model can be used to parameterize both α(c t ) and β(c t)
  • Simfluence-Linear cannot simulate training runs that include examples which have never been seen before
  • Simfluence-Linear oversimplifies the real dynamics of training
  • Simfluence-Additive fits a model to actual observed losses, while TracIn-CP uses hypothetical losses
  • Simfluence-Additive requires one loss reduction per model parameter
  • TDA methods seek to identify which training examples caused the biggest change in the model’s loss on a test example
  • Simfluence is a new paradigm for TDA
  • Simfluence-Additive outperforms all TracIn variants
  • Simfluence is closely related to other TDA methods
  • Simfluence is based on the full loss trajectory over the course of training
  • Simfluence has been applied to explain predictions and identify data artifacts in NLP tasks
  • Simfluence is used as a heuristic to identify training examples to remove in order to flip a target prediction