Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Deep learning has led to the curation of many datasets
- Training parameter-hungry models on large datasets poses problems
- Data distillation approaches aim to create data summaries
- Formal framework and taxonomy of existing approaches presented
- Data distillation approaches for images, graphs, and user-item interactions discussed
- Current challenges and future research directions identified
Paper Content
Introduction
- Data distillation is a task that aims to create tiny, high-fidelity summaries of data.
- The scale-is-everything viewpoint argues that bigger models and datasets are key for advancing AI.
- Data distillation is a more amenable and faster way to progress.
- Data distillation leads to cost savings, faster research iterations, and improved eco-sustainability.
- Data distillation also democratizes the pipeline and accelerates other procedures.
The data distillation framework
- Data distillation techniques aim to synthesize a high-fidelity data summary from a given dataset.
- The dataset consists of input features (x) and desired labels (y).
- The data summary is defined as a data budget (n) that is a fraction of the original dataset.
- A twice-differentiable cost function (l) is used to optimize the data summary.
- Existing data distillation techniques solve a bilevel optimization problem.
- The data summary is optimized through gradient descent.
Data distillation by meta-model matching
- Meta-model matching-based data distillation approaches optimize for transferability of models trained on data summary to original dataset
- Simplifying assumption is that perfect classifier exists and can be estimated
- TBPTT framework has drawbacks such as computationally expensive, bias, and poorly conditioned loss landscapes
- Neural Tangent Kernel (NTK) based algorithms solve inner-loop in closed form
- KIP uses NTK of neural network in inner-loop of equation
- RFAD uses light-weight Empirical Neural Network Gaussian Process kernel and classification loss for outer-loop
- FRePO decouples feature extractor and linear classifier and optimizes data summary and feature extractor
Data distillation by gradient matching
- Gradient matching based data distillation performs one-step distance matching on a network trained on the target dataset and the same network trained on the data summary
- Optimization is more efficient than meta-model matching framework
- Data summaries optimized by gradient-matching outperform heuristic data samplers, principled coreset construction techniques, and TBPTT-based data distillation
- Optimization objective is to minimize distance between parameters
- DSA improves by performing image-augmentations on both datasets
- DCC incorporates class contrastive signals inside each gradient-matching step
- IDC extends gradient matching framework by multi-formation and matching gradients of the network’s training trajectory over the full dataset
- TESLA re-parameterizes parameter-matching loss and uses learnable soft-labels
Data distillation by distribution matching
- Gradient-matching and trajectory-matching based data distillation techniques have been shown to synthesize high-quality data summaries, but are expensive in terms of computation time and memory.
- Distribution-matching techniques solve a correlated proxy task which restricts the optimization to a single-level, leading to improved scalability.
- Distribution-matching techniques match the distribution of data in D vs. D syn instead of the quality of models.
- DM uses parametric encoders to cast high-dimensional data into respective low-dimensional latent spaces.
- CAFE refines the distribution-matching idea by jointly optimizing a single encoder and the data summary.
- IT-GAN uses the distribution-matching framework to generate data that is informative for model training.
Data distillation by factorization
- All of the aforementioned data distillation frameworks maintain the synthesized data summary as a large set of free parameters.
- Factorization-based data distillation techniques parameterize the data summary using two components: bases and hallucinators.
- LinBa assumes the bases’ vector space to be the same as the task input space and the hallucinator to be linear and conditioned on a given predictand.
- HaBa relaxes the linear and predictand-conditional hallucinator assumption of LinBa.
- KFS maintains a different bases’ vector space from the data domain, allowing it to store more images.
- It is difficult to compare factorized and non-factorized data distillation techniques.
- TBPTT framework proposed by Wang et al. (2018) is used to distill textual data.
- Graph distillation has hurdles such as abstract nodes, intrinsic patterns, and quadratic size of the adjacency matrix.
- GCond, GCDM, and DosCond are used to distill graphs.
- Recommender systems data is available in the form of abstract and discrete tuples, has a power-law distribution, and has inherent structures.
- Distill-CF is used to distill implicit-feedback recommender systems data.
Applications
- Data distillation can be used to accelerate model training and for other applications
- Data distillation can be used for differential privacy
- Data distillation can be used to distill sensitive medical data
- Data distillation can be used for neural architecture search
- Data distillation can be used for continual learning and federated learning
Challenges & future directions
- Data distillation techniques have largely been restricted to image-classification settings
- Increasing sample efficiency of training image-generation models is important
- Developing unified, principled data distillation framework for discrete data is useful
- Investigating causes and potential fixes of scaling artifacts is necessary
- Bilevel optimization has been successfully applied in a variety of applications
- Theoretical underpinnings of bilevel optimization need to be explored
- Evaluating data distillation techniques on ConvNet and KRR
- Best-overall non-factorized method evaluated on ConvNet is colored orange
- Best-overall factorized method is colored blue