Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Goal: Augment pre-trained text-to-image diffusion model with open-vocabulary objects grounding
- Contribution: Insert grounding module into existing diffusion model, automatic pipeline for constructing dataset
Paper Content
Introduction
- Text-to-image generative models have strong semantic correspondence between visual and language
- Models lack ability to ground objects within generated images
- Paper aims to augment existing text-to-image diffusion model with ability to generate photorealistic images and segmentation masks
- Challenges include establishing visual-language correspondence and open-vocabulary grounding
- Automatic pipeline developed to construct {image, segmentation, text prompt} triplets
- Novel architecture proposed to segment any visual entity mentioned in text prompt
- Evaluation protocols initiated to validate effectiveness of open-vocabulary grounding
Related work
- Image generation is a challenging task in computer vision
- Generative adversarial networks (GANs), variational autoencoders (VAEs), flow-based models and autoregressive models (ARMs) have made progress
- Diffusion Probabilistic Models (DMs) demonstrate state-of-the-art generation quality
- Visual grounding is used to understand natural language queries and find target objects in an image
Methodology
- Aim to introduce a knowledge induction procedure to convert an existing text-to-image diffusion model for grounded generation.
- Core idea is to exploit image-segmentation pairs as visual demonstrations to build the general visual-language correspondence.
- Architecture design to align visual and textual embedding under an open-vocabulary setting.
- Training procedure based on {image, segmentation, text prompt} triplets.
Problem scenario
- A strong text-to-image diffusion model is assumed.
- The goal is to convert it into a grounded generation model.
- The model takes noise and language description as input and generates an image with segmentation masks.
- The model should be open-vocabulary, meaning it should be able to output segmentation masks for any objects.
Preliminary on diffusion model
- Diffusion models are probabilistic generative models that learn a data distribution by denoising randomly sampled Gaussian noises.
- Text-to-image synthesis uses a dataset of image-caption pairs to predict a denoised variant of the input conditioned on the text prompt.
- Stable Diffusion is a variant of diffusion model that encodes images with a variational autoencoder and transfers the diffusion process to latent space.
- Stable Diffusion consists of three components: a text encoder, a pre-trained variational autoencoder, and a time-conditional UNet.
- Training and inference of Stable Diffusion involves iteratively denoising a latent vector conditioned on the text prompt.
Open-vocabulary grounding
Dataset construction
- Introduce idea to construct training set with {visual feature, segmentation, text prompt} triplets
- Develop automatic pipeline to construct {image, segmentation, text prompt} triplets
- Visual feature obtained from Stable Diffusion via forward inference to generate image
- Prepare vocabulary with common object categories
- Randomly select number of classes to construct text prompt for image generation
- Acquire segmentation masks with off-the-shelf object detector
- Divide vocabulary set into seen and unseen categories
Architecture
- Visual encoder takes visual representation from Stable Diffusion and concatenates features with same spatial resolution
- Text encoder takes text prompt and outputs embeddings for all visual objects
Visual encoder
- Fusion module computes interaction between visual features and text embeddings.
- Outputs segmentation masks for all visual objects.
- Uses a standard transformer decoder with three layers.
- Text embeddings are treated as Query.
- Iteratively attend the visual feature for updating.
- Converted into per-segmentation embeddings with a MLP.
- Object segmentation masks obtained by dot product visual features with per-segmentation embeddings.
Training
- Constructed dataset used to train proposed grounding module
- Sigmoid function used to predict segmentation masks
- Two sources of errors: diffusion model and off-the-shelf detector
- Two training strategies: Normal Training and Training w.o. Zero Masks
Experiments
- Train grounding module with constructed training set
- Test segmentation performance on generated images from Stable Diffusion
- Use guided diffusion model to construct synthesized semantic segmentation dataset
- Train segmentation model on synthesized dataset
- Evaluate model on existing benchmarks for zero-shot segmentation
- Conduct ablation studies on different training strategies
Protocol-i: grounded generation
- Training set consists of subset of common (seen) categories, testing set consists of both seen and unseen categories
- Two different sets of categories adopted: PASCAL VOC and MS-COCO
- Training set only consists of seen categories
- Evaluation metric is category-wise mean intersection-over-union (mIoU)
- Model outperforms unsupervised method DAAM
- Model achieves superior performance on both seen and unseen categories
- Visualization results demonstrate successful grounding of objects in terms of segmentation mask
Protocol-ii: open-vocabulary segmentation
- Constructed a synthesized image-segmentation dataset with guided Stable Diffusion
- Trained a semantic segmentation model on the synthetic dataset
- Evaluated the model on public image segmentation benchmarks
- Evaluated the effectiveness of the grounding module from the performance on segmenting unseen categories
Voc and only evaluate on its test set (1,449 images).
- Model uses Mask-Former with ResNet101 as backbone
- Trained on synthetic dataset for 40k iterations with batch size 8
- Model outperforms most existing zero-shot semantic segmentation approaches
- Model obtains accurate segmentation on both seen and unseen categories
Ablation study
- Normal Training results in unsatisfactory performance on unseen categories
- Training without Zero Masks achieves equally good performance on both seen and unseen categories
- Performance for grounding decreases as denoising steps decrease, best result is obtained at t = 5
Conclusion
- Propose a novel idea for guiding the existing Stable Diffusion towards open-vocabulary grounded generation
- Introduce a grounding module to explicitly align the visual and textual embedding space of the Stable Diffusion
- Train the grounding module with an automatically constructed dataset of {image, segmentation, text prompts} triplets
- Visual-language correspondence can be established with only training on a limited number of object categories
- Generate a synthetic semantic segmentation dataset using the guided Stable Diffusion
- Train a semantic segmentation model without finetuning
- Show competitive performance to existing zero-shot semantic segmentation approaches on PASCAL VOC dataset
- Grounding module consists of visual encoder, text encoder, transformer decoder and MLP in the fusion module
- Construct the training set by randomly selecting one or two categories from the seen ones and using the prompt template
- Construct the test set by using all categories, including seen and unseen categories
- Synthetic semantic segmentation dataset consists of 500 images per category and 71 images per co-appearing category pair
- Training on the combination of one and two object categories gives the best results overall
- Grounding module can generalise to unseen categories, even as few as five seen categories
- Construct the training set by utilising the inverse process of diffusion
- Compare the performance of training on constructed dataset and real dataset
- Provide qualitative results of generated images and segmentation masks