Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Deep learning has led to the curation of many datasets
Training parameter-hungry models on large datasets poses problems
Data distillation approaches aim to create data summaries
Formal framework and taxonomy of existing approaches presented
Data distillation approaches for images, graphs, and user-item interactions discussed
Current challenges and future research directions identified

Data distillation is a task that aims to create tiny, high-fidelity summaries of data.
The scale-is-everything viewpoint argues that bigger models and datasets are key for advancing AI.
Data distillation is a more amenable and faster way to progress.
Data distillation leads to cost savings, faster research iterations, and improved eco-sustainability.
Data distillation also democratizes the pipeline and accelerates other procedures.

Data distillation techniques aim to synthesize a high-fidelity data summary from a given dataset.
The dataset consists of input features (x) and desired labels (y).
The data summary is defined as a data budget (n) that is a fraction of the original dataset.
A twice-differentiable cost function (l) is used to optimize the data summary.
Existing data distillation techniques solve a bilevel optimization problem.
The data summary is optimized through gradient descent.

Meta-model matching-based data distillation approaches optimize for transferability of models trained on data summary to original dataset
Simplifying assumption is that perfect classifier exists and can be estimated
TBPTT framework has drawbacks such as computationally expensive, bias, and poorly conditioned loss landscapes
Neural Tangent Kernel (NTK) based algorithms solve inner-loop in closed form
KIP uses NTK of neural network in inner-loop of equation
RFAD uses light-weight Empirical Neural Network Gaussian Process kernel and classification loss for outer-loop
FRePO decouples feature extractor and linear classifier and optimizes data summary and feature extractor

Gradient matching based data distillation performs one-step distance matching on a network trained on the target dataset and the same network trained on the data summary
Optimization is more efficient than meta-model matching framework
Data summaries optimized by gradient-matching outperform heuristic data samplers, principled coreset construction techniques, and TBPTT-based data distillation
Optimization objective is to minimize distance between parameters
DSA improves by performing image-augmentations on both datasets
DCC incorporates class contrastive signals inside each gradient-matching step
IDC extends gradient matching framework by multi-formation and matching gradients of the network’s training trajectory over the full dataset
TESLA re-parameterizes parameter-matching loss and uses learnable soft-labels

Gradient-matching and trajectory-matching based data distillation techniques have been shown to synthesize high-quality data summaries, but are expensive in terms of computation time and memory.
Distribution-matching techniques solve a correlated proxy task which restricts the optimization to a single-level, leading to improved scalability.
Distribution-matching techniques match the distribution of data in D vs. D syn instead of the quality of models.
DM uses parametric encoders to cast high-dimensional data into respective low-dimensional latent spaces.
CAFE refines the distribution-matching idea by jointly optimizing a single encoder and the data summary.
IT-GAN uses the distribution-matching framework to generate data that is informative for model training.

All of the aforementioned data distillation frameworks maintain the synthesized data summary as a large set of free parameters.
Factorization-based data distillation techniques parameterize the data summary using two components: bases and hallucinators.
LinBa assumes the bases’ vector space to be the same as the task input space and the hallucinator to be linear and conditioned on a given predictand.
HaBa relaxes the linear and predictand-conditional hallucinator assumption of LinBa.
KFS maintains a different bases’ vector space from the data domain, allowing it to store more images.
It is difficult to compare factorized and non-factorized data distillation techniques.
TBPTT framework proposed by Wang et al. (2018) is used to distill textual data.
Graph distillation has hurdles such as abstract nodes, intrinsic patterns, and quadratic size of the adjacency matrix.
GCond, GCDM, and DosCond are used to distill graphs.
Recommender systems data is available in the form of abstract and discrete tuples, has a power-law distribution, and has inherent structures.
Distill-CF is used to distill implicit-feedback recommender systems data.

Data distillation can be used to accelerate model training and for other applications
Data distillation can be used for differential privacy
Data distillation can be used to distill sensitive medical data
Data distillation can be used for neural architecture search
Data distillation can be used for continual learning and federated learning

Data distillation techniques have largely been restricted to image-classification settings
Increasing sample efficiency of training image-generation models is important
Developing unified, principled data distillation framework for discrete data is useful
Investigating causes and potential fixes of scaling artifacts is necessary
Bilevel optimization has been successfully applied in a variety of applications
Theoretical underpinnings of bilevel optimization need to be explored
Evaluating data distillation techniques on ConvNet and KRR
Best-overall non-factorized method evaluated on ConvNet is colored orange
Best-overall factorized method is colored blue