Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Goal: Augment pre-trained text-to-image diffusion model with open-vocabulary objects grounding
  • Contribution: Insert grounding module into existing diffusion model, automatic pipeline for constructing dataset

Paper Content

Introduction

  • Text-to-image generative models have strong semantic correspondence between visual and language
  • Models lack ability to ground objects within generated images
  • Paper aims to augment existing text-to-image diffusion model with ability to generate photorealistic images and segmentation masks
  • Challenges include establishing visual-language correspondence and open-vocabulary grounding
  • Automatic pipeline developed to construct {image, segmentation, text prompt} triplets
  • Novel architecture proposed to segment any visual entity mentioned in text prompt
  • Evaluation protocols initiated to validate effectiveness of open-vocabulary grounding
  • Image generation is a challenging task in computer vision
  • Generative adversarial networks (GANs), variational autoencoders (VAEs), flow-based models and autoregressive models (ARMs) have made progress
  • Diffusion Probabilistic Models (DMs) demonstrate state-of-the-art generation quality
  • Visual grounding is used to understand natural language queries and find target objects in an image

Methodology

  • Aim to introduce a knowledge induction procedure to convert an existing text-to-image diffusion model for grounded generation.
  • Core idea is to exploit image-segmentation pairs as visual demonstrations to build the general visual-language correspondence.
  • Architecture design to align visual and textual embedding under an open-vocabulary setting.
  • Training procedure based on {image, segmentation, text prompt} triplets.

Problem scenario

  • A strong text-to-image diffusion model is assumed.
  • The goal is to convert it into a grounded generation model.
  • The model takes noise and language description as input and generates an image with segmentation masks.
  • The model should be open-vocabulary, meaning it should be able to output segmentation masks for any objects.

Preliminary on diffusion model

  • Diffusion models are probabilistic generative models that learn a data distribution by denoising randomly sampled Gaussian noises.
  • Text-to-image synthesis uses a dataset of image-caption pairs to predict a denoised variant of the input conditioned on the text prompt.
  • Stable Diffusion is a variant of diffusion model that encodes images with a variational autoencoder and transfers the diffusion process to latent space.
  • Stable Diffusion consists of three components: a text encoder, a pre-trained variational autoencoder, and a time-conditional UNet.
  • Training and inference of Stable Diffusion involves iteratively denoising a latent vector conditioned on the text prompt.

Open-vocabulary grounding

Dataset construction

  • Introduce idea to construct training set with {visual feature, segmentation, text prompt} triplets
  • Develop automatic pipeline to construct {image, segmentation, text prompt} triplets
  • Visual feature obtained from Stable Diffusion via forward inference to generate image
  • Prepare vocabulary with common object categories
  • Randomly select number of classes to construct text prompt for image generation
  • Acquire segmentation masks with off-the-shelf object detector
  • Divide vocabulary set into seen and unseen categories

Architecture

  • Visual encoder takes visual representation from Stable Diffusion and concatenates features with same spatial resolution
  • Text encoder takes text prompt and outputs embeddings for all visual objects

Visual encoder

  • Fusion module computes interaction between visual features and text embeddings.
  • Outputs segmentation masks for all visual objects.
  • Uses a standard transformer decoder with three layers.
  • Text embeddings are treated as Query.
  • Iteratively attend the visual feature for updating.
  • Converted into per-segmentation embeddings with a MLP.
  • Object segmentation masks obtained by dot product visual features with per-segmentation embeddings.

Training

  • Constructed dataset used to train proposed grounding module
  • Sigmoid function used to predict segmentation masks
  • Two sources of errors: diffusion model and off-the-shelf detector
  • Two training strategies: Normal Training and Training w.o. Zero Masks

Experiments

  • Train grounding module with constructed training set
  • Test segmentation performance on generated images from Stable Diffusion
  • Use guided diffusion model to construct synthesized semantic segmentation dataset
  • Train segmentation model on synthesized dataset
  • Evaluate model on existing benchmarks for zero-shot segmentation
  • Conduct ablation studies on different training strategies

Protocol-i: grounded generation

  • Training set consists of subset of common (seen) categories, testing set consists of both seen and unseen categories
  • Two different sets of categories adopted: PASCAL VOC and MS-COCO
  • Training set only consists of seen categories
  • Evaluation metric is category-wise mean intersection-over-union (mIoU)
  • Model outperforms unsupervised method DAAM
  • Model achieves superior performance on both seen and unseen categories
  • Visualization results demonstrate successful grounding of objects in terms of segmentation mask

Protocol-ii: open-vocabulary segmentation

  • Constructed a synthesized image-segmentation dataset with guided Stable Diffusion
  • Trained a semantic segmentation model on the synthetic dataset
  • Evaluated the model on public image segmentation benchmarks
  • Evaluated the effectiveness of the grounding module from the performance on segmenting unseen categories

Voc and only evaluate on its test set (1,449 images).

  • Model uses Mask-Former with ResNet101 as backbone
  • Trained on synthetic dataset for 40k iterations with batch size 8
  • Model outperforms most existing zero-shot semantic segmentation approaches
  • Model obtains accurate segmentation on both seen and unseen categories

Ablation study

  • Normal Training results in unsatisfactory performance on unseen categories
  • Training without Zero Masks achieves equally good performance on both seen and unseen categories
  • Performance for grounding decreases as denoising steps decrease, best result is obtained at t = 5

Conclusion

  • Propose a novel idea for guiding the existing Stable Diffusion towards open-vocabulary grounded generation
  • Introduce a grounding module to explicitly align the visual and textual embedding space of the Stable Diffusion
  • Train the grounding module with an automatically constructed dataset of {image, segmentation, text prompts} triplets
  • Visual-language correspondence can be established with only training on a limited number of object categories
  • Generate a synthetic semantic segmentation dataset using the guided Stable Diffusion
  • Train a semantic segmentation model without finetuning
  • Show competitive performance to existing zero-shot semantic segmentation approaches on PASCAL VOC dataset
  • Grounding module consists of visual encoder, text encoder, transformer decoder and MLP in the fusion module
  • Construct the training set by randomly selecting one or two categories from the seen ones and using the prompt template
  • Construct the test set by using all categories, including seen and unseen categories
  • Synthetic semantic segmentation dataset consists of 500 images per category and 71 images per co-appearing category pair
  • Training on the combination of one and two object categories gives the best results overall
  • Grounding module can generalise to unseen categories, even as few as five seen categories
  • Construct the training set by utilising the inverse process of diffusion
  • Compare the performance of training on constructed dataset and real dataset
  • Provide qualitative results of generated images and segmentation masks