Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Diffusion models are a type of generative model that can be used to create images from text.
  • VPD is a new framework that uses a pre-trained text-to-image diffusion model for visual perception tasks.
  • VPD uses an adapter to refine text features and cross-attention maps to provide guidance.
  • VPD achieves good results on semantic segmentation, referring image segmentation and depth estimation.

Paper Content

Introduction

  • Large text-to-image diffusion models can generate high-quality images with rich texture, diverse content and reasonable structures.
  • Text-to-image diffusion models can implicitly learn both high-level and low-level visual concepts from massive image-text pairs.
  • Text-to-image models are designed to generate high-fidelity images based on textual prompts.
  • Conventional visual pre-training methods aim to encode the input image as latent representations.
  • Text-to-image models take as input random noises and text prompts and aim to produce images through a progressive denoising process.
  • Extracting the visual knowledge learned by large diffusion models for visual perception tasks is non-trivial.
  • VPD framework proposed to adapt pre-trained diffusion models for visual perception tasks.
  • VPD framework evaluated on three visual perception tasks: semantic segmentation, referring image segmentation, and depth estimation.
  • VPD framework outperforms existing pre-training methods on visual perception tasks.
  • Diffusion models are a new family of generative models with good synthesis quality and controllability
  • Training a denoising autoencoder can be used to learn the inverse of a Markovian diffusion process
  • Sampling from a diffusion model is a progressive denoising procedure
  • Latent diffusion models reduce computational costs and allow for training text-to-image diffusion models
  • Pre-training and fine-tuning can be used to facilitate downstream visual perception tasks
  • Text-to-image generation can be used as an alternative for visual pre-training

Method

  • Presents VPD, a new framework for visual perception
  • Leverages pre-trained text-to-image diffusion model to extract high-level knowledge

Preliminaries: diffusion models

  • Diffusion models are a new family of generative models that can reconstruct the distribution of data.
  • Diffusion models are modeled as a Markov process.
  • Training objective of diffusion models can be derived as an autoencoder.
  • Recently, a text-to-image model has demonstrated remarkable performance on image synthesis controlled by natural language.

Prompting text-to-image diffusion model

  • Pre-trained diffusion model contains enough information to sample from data distribution
  • Text-to-image model has enough high-level knowledge due to weak supervision of natural language during pretraining
  • Goal is to exploit knowledge of well-trained text-conditioned model and transfer to downstream visual perception tasks
  • Prediction model rewritten as p(y|x), where y is task-specific label and x is input image
  • Build connection between task-specific label and natural language to extract learned semantic information
  • Text features extracted using CLIP text encoder and text adapter implemented as two-layer MLP
  • Prediction head implemented as Semantic FPN
  • Method not diffusion-based framework, only uses single UNet as backbone

Semantic guidance via cross-attention

  • Proposed to use cross-attention map as explicit semantic guidance
  • Cross-attention maps have good locality
  • Averaged cross-attention map aggregates semantic information of a certain category
  • Concatenate averaged cross-attention maps with original hierarchical feature maps
  • Explicit semantic guidance helps model adapt to downstream tasks

Implementation

  • Three visual perception tasks are considered: semantic segmentation, referring image segmentation, and depth estimation.
  • Similar architecture is used for all tasks, but with minor design differences.
  • Procedure to obtain conditioning inputs C differs for each task.
  • Output channels of task-specific head p φ3 (y|F) are different.
  • Training objectives for the three tasks are varied.

Experiments

  • Conducted experiments on three visual tasks
  • Tasks cover high-level and low-level visual perception
  • Results and ablation studies provided

Experiment setups

  • Common configurations of VPD are provided
  • VQGAN encoder and CLIP text encoder are fixed during training
  • Learning rate of θ is set to 1/10 of base learning rate
  • γ is set to 1e-4 for text adapter
  • Semantic Segmentation is evaluated on ADE20K
  • Referring Image Segmentation is evaluated on RefCOCO, Ref-COCO+, and G-Ref
  • Depth Estimation is evaluated on NYUv2
  • Training batch size is 16 for Semantic Segmentation, 32 for Referring Image Segmentation, and 24 for Depth Estimation
  • Learning rate is 1e-4 for Semantic Segmentation, 5e-5 for Referring Image Segmentation, and 5e-4 for Depth Estimation

Main results

  • Evaluated VPD on three downstream tasks: semantic segmentation, referring image segmentation, and depth estimation
  • Semantic segmentation: VPD outperformed pre-trained ConvNeXt-XL model with 53.7 mIoU ss and 54.6 mIoU ms
  • Referring image segmentation: VPD outperformed previous methods with large margins on RefCOCO, RefCOCO+, and G-Ref datasets
  • VPD achieved better results due to pre-trained language models that interact with the visual modality
  • VPD can quickly adapt to downstream visual perception tasks

Analysis

  • Evaluated effectiveness of components of VPD
  • Found that text prompts can build connection between visual and language domains
  • Text adapter and cross-attention maps improved performance
  • Cross-attention maps from upsampling blocks more beneficial
  • Pre-trained weights of text-to-image diffusion model affect performance
  • Computational cost of VPD is high

Conclusion

  • Proposed a new framework called VPD to transfer high-level knowledge of pre-trained text-to-image diffusion model to downstream tasks
  • Encouraged visual-language alignment and prompted pre-trained model implicitly and explicitly
  • Experiments on semantic segmentation, referring image segmentation, and depth estimation showed VPD can achieve competitive performance and faster convergence
  • Exploited low-level and high-level pre-trained knowledge and can be applied to a variety of visual perception tasks
  • Compared performance of VPD with previous models with different architectures and pre-training methods