Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Stable Diffusion is a large-scale image generation model that can create realistic images from a simple text prompt.
- This paper explores whether real images are necessary for training image prediction models.
- The paper tests the ability of Stable Diffusion to generate synthetic clones of ImageNet and measure their usefulness for training classification models.
- The synthetic images are able to close the gap between models produced by synthetic images and models trained with real images.
- Models trained on synthetic images exhibit strong generalization properties and perform on par with models trained on real data.
Paper Content
Introduction
- Machine learning and deep learning have changed the landscape of computer vision research
- Datasets have grown in size and complexity and have become contributions in their own right
- ImageNet has had an unprecedented impact on the field
- Large and generic models have been trained on less curated data
- These models have been used for prediction tasks and text-conditioned image generation
- Models such as DALL-E and Stable Diffusion have demonstrated impressive image generation ability
- These generative models are trained on billion-scale datasets
- This paper explores the need for real images when training image prediction models
- Synthetic images are generated for ImageNet classes using class names as prompts
- Models trained on synthetic images can achieve 83% and 51% top-5 accuracy on ImageNet-100 and ImageNet-1K respectively
- The performance gap between models trained on real and synthetic images is reduced when testing for resilience to domain shifts or adversarial examples
- Synthetic data has been used to create large amounts of labeled data for annotation heavy tasks
- Generative models have been used to extend models to new classes or create novel views at test time
Distillation of datasets and models
- Knowledge distillation is a mechanism to transfer knowledge from a pretrained model to another.
- Dataset distillation is a way of compressing a training set of images into a smaller set.
- Reconstructing images from model activations is another form of distillation.
- Our approach uses prompting to distill knowledge into a visual encoder for a specific image classification task.
Preliminaries
- Task formulation: Learning an image classification model using only synthetically generated images
- Stable Diffusion: Text-to-image generation model used in paper
- Model consists of an encoder and classifier
- Text-to-image generation process takes a textual prompt as input
- Parameters control quality and speed of text-conditioned diffusion
- Task is also seen as text-guided, image-free knowledge “distillation”
Generating synthetic imagenet clones
- Created clones of ImageNet dataset by synthesizing images
- Referred to synthetic datasets of ImageNet classes created using Stable Diffusion as ImageNet-SD
- Described different ways of creating ImageNet-SD datasets
- Presented generic, class-agnostic ways for tackling issues with respect to semantics and diversity
- Presented limited set of qualitative results in Fig. 2, more extensive set in Appendix
Generating datasets using class names
- Generated images using class name as prompt
- Semantic errors in generated images
- Lack of diversity in generated images
- Visual domain issues in generated images
Addressing issues with semantics and domain
- Comparing real images from ImageNet to synthetic ones generated using synset names as prompt reveals that for some classes their semantics do not match.
- This is due to polysemy, i.e., multiple potential semantic meanings or physical instantiations of the class names used as prompt.
- To fix this, two additional sources of information from Wordnet are employed: hypernyms and definitions.
- Appending this information to the prompt leads to better classification performance.
Increasing the diversity of generated images
- Generating images using more expressive prompts reduces semantic errors and increases visual diversity.
- Images tend to display the class instance centered and in a prominent position.
- Real images feature more diversity, settings, backgrounds, and multiple instances of the same class.
- Generating multiple instances and diversifying the background improves diversity.
- Label noise and visual realism provide additional stochasticity during training.
- Experimental validation shows diverse data augmentations can be valuable for out-of-distribution scenarios.
Experiments
- Analyzing performance of image classification models learned using synthetic datasets
- Most of study done on ImageNet-100 dataset
- ImageNet-100 is a random subset of ImageNet-1K, spanning over 100 classes and 126,689 images
- Synthetic datasets for two ImageNet subsets are ImageNet-100-SD and ImageNet-1K-SD
- Generator is Stable Diffusion v1.4 model
- Evaluate models on real images
- Encoder and classifier learned during training used to predict labels of real images
- Encoder is ResNet50
- Data augmentation pipeline from DINO used
Results on imagenet
- Models trained on real or synthetic images compared on ImageNet-1K and ImageNet-v2
- Data augmentation strategies evaluated when learning from synthetic datasets
- Gains for real images are small, but gains for synthetic images are over 14%
- Synthetic images can benefit from same augmentations as real images
- Different prompts used to create variants of ImageNet-100-SD
- Using class name as prompt achieves over 70% Top-5 accuracy
- Prompts that improve diversity give strong gains
- Generating objects on diverse backgrounds gives best results
- Scaling number of synthetic images gives 6-7% gains in accuracy
- Model trained on synthetic ImageNet-1K-SD reaches 26.2% Top-1 accuracy
Resilience to domain shifts
- Models trained on synthetic data can match the performance of models trained on real images
- Models trained on synthetic data can outperform models trained on real images in some cases
- Models trained on synthetic data lag behind models trained on real images when it comes to harder classification tasks
Transfer learning
- Pretrained models were used in previous evaluations
- Evaluations used encoders and classifiers trained on synthetic ImageNet datasets
- Different protocol used in this paper: evaluated quality of representations learned by encoders alone
- Ten commonly-used transfer datasets used
- Reported Top-1 accuracy on test set of datasets
- Compared ImageNet-100-SD and ImageNet-1K-SD visual encoders with baselines trained on ImageNet-100 and ImageNet-1K
- Representations learned on synthetic images showed comparable generalization performance to representations trained on real images
Discussion
- Implications of analysis proposed in paper
- Process to create ImageNet-SD requires minimal assumptions and can be applied to wider set of classes
- Scaling synthetic datasets important, but beyond scope of paper
- Quality of classifier bounded by expressivity of generator
- Data and model bias due to data and architecture
- Stereotypes reinforced due to lack of diversity
- Societal implications of using such models to generate synthetic datasets
Conclusions
- Study to what extent ImageNet can be replaced by a dataset synthesized by a top performing text-to-image generator
- Models trained on synthetic data exhibit exceptional generalization capability
- Evaluate models in two ways: pretrained models and classifiers learned during pretraining
- Analyze representations obtained with models trained on real and synthetic data
- Sparsity, intra-class distance, feature redundancy and coding length metrics used
- Spider plots for models trained on real or synthetic data for ImageNet-100 and ImageNet-1K
- Qualitative results for all ImageNet-100 classes
- Showcase domain and diversity issues
- Semantics partially wrong for some classes
- Appending hypernym or definition of each synset fixes polysemy issues
- Increasing diversity correlates with more semantic errors
- Non-negligible percentage of generated images are non-natural