Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Deep learning has led to the curation of many datasets
  • Training parameter-hungry models on large datasets poses problems
  • Data distillation approaches aim to create data summaries
  • Formal framework and taxonomy of existing approaches presented
  • Data distillation approaches for images, graphs, and user-item interactions discussed
  • Current challenges and future research directions identified

Paper Content


  • Data distillation is a task that aims to create tiny, high-fidelity summaries of data.
  • The scale-is-everything viewpoint argues that bigger models and datasets are key for advancing AI.
  • Data distillation is a more amenable and faster way to progress.
  • Data distillation leads to cost savings, faster research iterations, and improved eco-sustainability.
  • Data distillation also democratizes the pipeline and accelerates other procedures.

The data distillation framework

  • Data distillation techniques aim to synthesize a high-fidelity data summary from a given dataset.
  • The dataset consists of input features (x) and desired labels (y).
  • The data summary is defined as a data budget (n) that is a fraction of the original dataset.
  • A twice-differentiable cost function (l) is used to optimize the data summary.
  • Existing data distillation techniques solve a bilevel optimization problem.
  • The data summary is optimized through gradient descent.

Data distillation by meta-model matching

  • Meta-model matching-based data distillation approaches optimize for transferability of models trained on data summary to original dataset
  • Simplifying assumption is that perfect classifier exists and can be estimated
  • TBPTT framework has drawbacks such as computationally expensive, bias, and poorly conditioned loss landscapes
  • Neural Tangent Kernel (NTK) based algorithms solve inner-loop in closed form
  • KIP uses NTK of neural network in inner-loop of equation
  • RFAD uses light-weight Empirical Neural Network Gaussian Process kernel and classification loss for outer-loop
  • FRePO decouples feature extractor and linear classifier and optimizes data summary and feature extractor

Data distillation by gradient matching

  • Gradient matching based data distillation performs one-step distance matching on a network trained on the target dataset and the same network trained on the data summary
  • Optimization is more efficient than meta-model matching framework
  • Data summaries optimized by gradient-matching outperform heuristic data samplers, principled coreset construction techniques, and TBPTT-based data distillation
  • Optimization objective is to minimize distance between parameters
  • DSA improves by performing image-augmentations on both datasets
  • DCC incorporates class contrastive signals inside each gradient-matching step
  • IDC extends gradient matching framework by multi-formation and matching gradients of the network’s training trajectory over the full dataset
  • TESLA re-parameterizes parameter-matching loss and uses learnable soft-labels

Data distillation by distribution matching

  • Gradient-matching and trajectory-matching based data distillation techniques have been shown to synthesize high-quality data summaries, but are expensive in terms of computation time and memory.
  • Distribution-matching techniques solve a correlated proxy task which restricts the optimization to a single-level, leading to improved scalability.
  • Distribution-matching techniques match the distribution of data in D vs. D syn instead of the quality of models.
  • DM uses parametric encoders to cast high-dimensional data into respective low-dimensional latent spaces.
  • CAFE refines the distribution-matching idea by jointly optimizing a single encoder and the data summary.
  • IT-GAN uses the distribution-matching framework to generate data that is informative for model training.

Data distillation by factorization

  • All of the aforementioned data distillation frameworks maintain the synthesized data summary as a large set of free parameters.
  • Factorization-based data distillation techniques parameterize the data summary using two components: bases and hallucinators.
  • LinBa assumes the bases’ vector space to be the same as the task input space and the hallucinator to be linear and conditioned on a given predictand.
  • HaBa relaxes the linear and predictand-conditional hallucinator assumption of LinBa.
  • KFS maintains a different bases’ vector space from the data domain, allowing it to store more images.
  • It is difficult to compare factorized and non-factorized data distillation techniques.
  • TBPTT framework proposed by Wang et al. (2018) is used to distill textual data.
  • Graph distillation has hurdles such as abstract nodes, intrinsic patterns, and quadratic size of the adjacency matrix.
  • GCond, GCDM, and DosCond are used to distill graphs.
  • Recommender systems data is available in the form of abstract and discrete tuples, has a power-law distribution, and has inherent structures.
  • Distill-CF is used to distill implicit-feedback recommender systems data.


  • Data distillation can be used to accelerate model training and for other applications
  • Data distillation can be used for differential privacy
  • Data distillation can be used to distill sensitive medical data
  • Data distillation can be used for neural architecture search
  • Data distillation can be used for continual learning and federated learning

Challenges & future directions

  • Data distillation techniques have largely been restricted to image-classification settings
  • Increasing sample efficiency of training image-generation models is important
  • Developing unified, principled data distillation framework for discrete data is useful
  • Investigating causes and potential fixes of scaling artifacts is necessary
  • Bilevel optimization has been successfully applied in a variety of applications
  • Theoretical underpinnings of bilevel optimization need to be explored
  • Evaluating data distillation techniques on ConvNet and KRR
  • Best-overall non-factorized method evaluated on ConvNet is colored orange
  • Best-overall factorized method is colored blue