Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Stable Diffusion is a large-scale image generation model that can create realistic images from a simple text prompt.
This paper explores whether real images are necessary for training image prediction models.
The paper tests the ability of Stable Diffusion to generate synthetic clones of ImageNet and measure their usefulness for training classification models.
The synthetic images are able to close the gap between models produced by synthetic images and models trained with real images.
Models trained on synthetic images exhibit strong generalization properties and perform on par with models trained on real data.

Paper Content

Introduction

Machine learning and deep learning have changed the landscape of computer vision research
Datasets have grown in size and complexity and have become contributions in their own right
ImageNet has had an unprecedented impact on the field
Large and generic models have been trained on less curated data
These models have been used for prediction tasks and text-conditioned image generation
Models such as DALL-E and Stable Diffusion have demonstrated impressive image generation ability
These generative models are trained on billion-scale datasets
This paper explores the need for real images when training image prediction models
Synthetic images are generated for ImageNet classes using class names as prompts
Models trained on synthetic images can achieve 83% and 51% top-5 accuracy on ImageNet-100 and ImageNet-1K respectively
The performance gap between models trained on real and synthetic images is reduced when testing for resilience to domain shifts or adversarial examples
Synthetic data has been used to create large amounts of labeled data for annotation heavy tasks
Generative models have been used to extend models to new classes or create novel views at test time

Distillation of datasets and models

Knowledge distillation is a mechanism to transfer knowledge from a pretrained model to another.
Dataset distillation is a way of compressing a training set of images into a smaller set.
Reconstructing images from model activations is another form of distillation.
Our approach uses prompting to distill knowledge into a visual encoder for a specific image classification task.

Preliminaries

Task formulation: Learning an image classification model using only synthetically generated images
Stable Diffusion: Text-to-image generation model used in paper
Model consists of an encoder and classifier
Text-to-image generation process takes a textual prompt as input
Parameters control quality and speed of text-conditioned diffusion
Task is also seen as text-guided, image-free knowledge “distillation”

Generating synthetic imagenet clones

Created clones of ImageNet dataset by synthesizing images
Referred to synthetic datasets of ImageNet classes created using Stable Diffusion as ImageNet-SD
Described different ways of creating ImageNet-SD datasets
Presented generic, class-agnostic ways for tackling issues with respect to semantics and diversity
Presented limited set of qualitative results in Fig. 2, more extensive set in Appendix

Generating datasets using class names

Generated images using class name as prompt
Semantic errors in generated images
Lack of diversity in generated images
Visual domain issues in generated images

Addressing issues with semantics and domain

Comparing real images from ImageNet to synthetic ones generated using synset names as prompt reveals that for some classes their semantics do not match.
This is due to polysemy, i.e., multiple potential semantic meanings or physical instantiations of the class names used as prompt.
To fix this, two additional sources of information from Wordnet are employed: hypernyms and definitions.
Appending this information to the prompt leads to better classification performance.

Increasing the diversity of generated images

Generating images using more expressive prompts reduces semantic errors and increases visual diversity.
Images tend to display the class instance centered and in a prominent position.
Real images feature more diversity, settings, backgrounds, and multiple instances of the same class.
Generating multiple instances and diversifying the background improves diversity.
Label noise and visual realism provide additional stochasticity during training.
Experimental validation shows diverse data augmentations can be valuable for out-of-distribution scenarios.

Experiments

Analyzing performance of image classification models learned using synthetic datasets
Most of study done on ImageNet-100 dataset
ImageNet-100 is a random subset of ImageNet-1K, spanning over 100 classes and 126,689 images
Synthetic datasets for two ImageNet subsets are ImageNet-100-SD and ImageNet-1K-SD
Generator is Stable Diffusion v1.4 model
Evaluate models on real images
Encoder and classifier learned during training used to predict labels of real images
Encoder is ResNet50
Data augmentation pipeline from DINO used

Results on imagenet

Models trained on real or synthetic images compared on ImageNet-1K and ImageNet-v2
Data augmentation strategies evaluated when learning from synthetic datasets
Gains for real images are small, but gains for synthetic images are over 14%
Synthetic images can benefit from same augmentations as real images
Different prompts used to create variants of ImageNet-100-SD
Using class name as prompt achieves over 70% Top-5 accuracy
Prompts that improve diversity give strong gains
Generating objects on diverse backgrounds gives best results
Scaling number of synthetic images gives 6-7% gains in accuracy
Model trained on synthetic ImageNet-1K-SD reaches 26.2% Top-1 accuracy

Resilience to domain shifts

Models trained on synthetic data can match the performance of models trained on real images
Models trained on synthetic data can outperform models trained on real images in some cases
Models trained on synthetic data lag behind models trained on real images when it comes to harder classification tasks

Transfer learning

Pretrained models were used in previous evaluations
Evaluations used encoders and classifiers trained on synthetic ImageNet datasets
Different protocol used in this paper: evaluated quality of representations learned by encoders alone
Ten commonly-used transfer datasets used
Reported Top-1 accuracy on test set of datasets
Compared ImageNet-100-SD and ImageNet-1K-SD visual encoders with baselines trained on ImageNet-100 and ImageNet-1K
Representations learned on synthetic images showed comparable generalization performance to representations trained on real images

Discussion

Implications of analysis proposed in paper
Process to create ImageNet-SD requires minimal assumptions and can be applied to wider set of classes
Scaling synthetic datasets important, but beyond scope of paper
Quality of classifier bounded by expressivity of generator
Data and model bias due to data and architecture
Stereotypes reinforced due to lack of diversity
Societal implications of using such models to generate synthetic datasets

Conclusions

Study to what extent ImageNet can be replaced by a dataset synthesized by a top performing text-to-image generator
Models trained on synthetic data exhibit exceptional generalization capability
Evaluate models in two ways: pretrained models and classifiers learned during pretraining
Analyze representations obtained with models trained on real and synthetic data
Sparsity, intra-class distance, feature redundancy and coding length metrics used
Spider plots for models trained on real or synthetic data for ImageNet-100 and ImageNet-1K
Qualitative results for all ImageNet-100 classes
Showcase domain and diversity issues
Semantics partially wrong for some classes
Appending hypernym or definition of each synset fixes polysemy issues
Increasing diversity correlates with more semantic errors
Non-negligible percentage of generated images are non-natural

Link to paper#

Abstract#

Paper Content#

Introduction#

Distillation of datasets and models#

Preliminaries#

Generating synthetic imagenet clones#

Generating datasets using class names#

Addressing issues with semantics and domain#

Increasing the diversity of generated images#

Experiments#

Results on imagenet#

Resilience to domain shifts#

Transfer learning#

Discussion#

Conclusions#

Link to paper

Abstract

Paper Content

Introduction

Distillation of datasets and models

Preliminaries

Generating synthetic imagenet clones

Generating datasets using class names

Addressing issues with semantics and domain

Increasing the diversity of generated images

Experiments

Results on imagenet

Resilience to domain shifts

Transfer learning

Discussion

Conclusions