Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Goal: Augment pre-trained text-to-image diffusion model with open-vocabulary objects grounding
Contribution: Insert grounding module into existing diffusion model, automatic pipeline for constructing dataset

Paper Content

Introduction

Text-to-image generative models have strong semantic correspondence between visual and language
Models lack ability to ground objects within generated images
Paper aims to augment existing text-to-image diffusion model with ability to generate photorealistic images and segmentation masks
Challenges include establishing visual-language correspondence and open-vocabulary grounding
Automatic pipeline developed to construct {image, segmentation, text prompt} triplets
Novel architecture proposed to segment any visual entity mentioned in text prompt
Evaluation protocols initiated to validate effectiveness of open-vocabulary grounding

Image generation is a challenging task in computer vision
Generative adversarial networks (GANs), variational autoencoders (VAEs), flow-based models and autoregressive models (ARMs) have made progress
Diffusion Probabilistic Models (DMs) demonstrate state-of-the-art generation quality
Visual grounding is used to understand natural language queries and find target objects in an image

Methodology

Aim to introduce a knowledge induction procedure to convert an existing text-to-image diffusion model for grounded generation.
Core idea is to exploit image-segmentation pairs as visual demonstrations to build the general visual-language correspondence.
Architecture design to align visual and textual embedding under an open-vocabulary setting.
Training procedure based on {image, segmentation, text prompt} triplets.

Problem scenario

A strong text-to-image diffusion model is assumed.
The goal is to convert it into a grounded generation model.
The model takes noise and language description as input and generates an image with segmentation masks.
The model should be open-vocabulary, meaning it should be able to output segmentation masks for any objects.

Preliminary on diffusion model

Diffusion models are probabilistic generative models that learn a data distribution by denoising randomly sampled Gaussian noises.
Text-to-image synthesis uses a dataset of image-caption pairs to predict a denoised variant of the input conditioned on the text prompt.
Stable Diffusion is a variant of diffusion model that encodes images with a variational autoencoder and transfers the diffusion process to latent space.
Stable Diffusion consists of three components: a text encoder, a pre-trained variational autoencoder, and a time-conditional UNet.
Training and inference of Stable Diffusion involves iteratively denoising a latent vector conditioned on the text prompt.

Open-vocabulary grounding

Dataset construction

Introduce idea to construct training set with {visual feature, segmentation, text prompt} triplets
Develop automatic pipeline to construct {image, segmentation, text prompt} triplets
Visual feature obtained from Stable Diffusion via forward inference to generate image
Prepare vocabulary with common object categories
Randomly select number of classes to construct text prompt for image generation
Acquire segmentation masks with off-the-shelf object detector
Divide vocabulary set into seen and unseen categories

Architecture

Visual encoder takes visual representation from Stable Diffusion and concatenates features with same spatial resolution
Text encoder takes text prompt and outputs embeddings for all visual objects

Visual encoder

Fusion module computes interaction between visual features and text embeddings.
Outputs segmentation masks for all visual objects.
Uses a standard transformer decoder with three layers.
Text embeddings are treated as Query.
Iteratively attend the visual feature for updating.
Converted into per-segmentation embeddings with a MLP.
Object segmentation masks obtained by dot product visual features with per-segmentation embeddings.

Training

Constructed dataset used to train proposed grounding module
Sigmoid function used to predict segmentation masks
Two sources of errors: diffusion model and off-the-shelf detector
Two training strategies: Normal Training and Training w.o. Zero Masks

Experiments

Train grounding module with constructed training set
Test segmentation performance on generated images from Stable Diffusion
Use guided diffusion model to construct synthesized semantic segmentation dataset
Train segmentation model on synthesized dataset
Evaluate model on existing benchmarks for zero-shot segmentation
Conduct ablation studies on different training strategies

Protocol-i: grounded generation

Training set consists of subset of common (seen) categories, testing set consists of both seen and unseen categories
Two different sets of categories adopted: PASCAL VOC and MS-COCO
Training set only consists of seen categories
Evaluation metric is category-wise mean intersection-over-union (mIoU)
Model outperforms unsupervised method DAAM
Model achieves superior performance on both seen and unseen categories
Visualization results demonstrate successful grounding of objects in terms of segmentation mask

Protocol-ii: open-vocabulary segmentation

Constructed a synthesized image-segmentation dataset with guided Stable Diffusion
Trained a semantic segmentation model on the synthetic dataset
Evaluated the model on public image segmentation benchmarks
Evaluated the effectiveness of the grounding module from the performance on segmenting unseen categories

Voc and only evaluate on its test set (1,449 images).

Model uses Mask-Former with ResNet101 as backbone
Trained on synthetic dataset for 40k iterations with batch size 8
Model outperforms most existing zero-shot semantic segmentation approaches
Model obtains accurate segmentation on both seen and unseen categories

Ablation study

Normal Training results in unsatisfactory performance on unseen categories
Training without Zero Masks achieves equally good performance on both seen and unseen categories
Performance for grounding decreases as denoising steps decrease, best result is obtained at t = 5

Conclusion

Propose a novel idea for guiding the existing Stable Diffusion towards open-vocabulary grounded generation
Introduce a grounding module to explicitly align the visual and textual embedding space of the Stable Diffusion
Train the grounding module with an automatically constructed dataset of {image, segmentation, text prompts} triplets
Visual-language correspondence can be established with only training on a limited number of object categories
Generate a synthetic semantic segmentation dataset using the guided Stable Diffusion
Train a semantic segmentation model without finetuning
Show competitive performance to existing zero-shot semantic segmentation approaches on PASCAL VOC dataset
Grounding module consists of visual encoder, text encoder, transformer decoder and MLP in the fusion module
Construct the training set by randomly selecting one or two categories from the seen ones and using the prompt template
Construct the test set by using all categories, including seen and unseen categories
Synthetic semantic segmentation dataset consists of 500 images per category and 71 images per co-appearing category pair
Training on the combination of one and two object categories gives the best results overall
Grounding module can generalise to unseen categories, even as few as five seen categories
Construct the training set by utilising the inverse process of diffusion
Compare the performance of training on constructed dataset and real dataset
Provide qualitative results of generated images and segmentation masks

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Methodology#

Problem scenario#

Preliminary on diffusion model#

Open-vocabulary grounding#

Dataset construction#

Architecture#

Visual encoder#

Training#

Experiments#

Protocol-i: grounded generation#

Protocol-ii: open-vocabulary segmentation#

Voc and only evaluate on its test set (1,449 images).#

Ablation study#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Methodology

Problem scenario

Preliminary on diffusion model

Open-vocabulary grounding

Dataset construction

Architecture

Visual encoder

Training

Experiments

Protocol-i: grounded generation

Protocol-ii: open-vocabulary segmentation

Voc and only evaluate on its test set (1,449 images).

Ablation study

Conclusion