Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Diffusion models are a type of generative model that can be used to create images from text.
VPD is a new framework that uses a pre-trained text-to-image diffusion model for visual perception tasks.
VPD uses an adapter to refine text features and cross-attention maps to provide guidance.
VPD achieves good results on semantic segmentation, referring image segmentation and depth estimation.

Paper Content

Introduction

Large text-to-image diffusion models can generate high-quality images with rich texture, diverse content and reasonable structures.
Text-to-image diffusion models can implicitly learn both high-level and low-level visual concepts from massive image-text pairs.
Text-to-image models are designed to generate high-fidelity images based on textual prompts.
Conventional visual pre-training methods aim to encode the input image as latent representations.
Text-to-image models take as input random noises and text prompts and aim to produce images through a progressive denoising process.
Extracting the visual knowledge learned by large diffusion models for visual perception tasks is non-trivial.
VPD framework proposed to adapt pre-trained diffusion models for visual perception tasks.
VPD framework evaluated on three visual perception tasks: semantic segmentation, referring image segmentation, and depth estimation.
VPD framework outperforms existing pre-training methods on visual perception tasks.

Diffusion models are a new family of generative models with good synthesis quality and controllability
Training a denoising autoencoder can be used to learn the inverse of a Markovian diffusion process
Sampling from a diffusion model is a progressive denoising procedure
Latent diffusion models reduce computational costs and allow for training text-to-image diffusion models
Pre-training and fine-tuning can be used to facilitate downstream visual perception tasks
Text-to-image generation can be used as an alternative for visual pre-training

Method

Presents VPD, a new framework for visual perception
Leverages pre-trained text-to-image diffusion model to extract high-level knowledge

Preliminaries: diffusion models

Diffusion models are a new family of generative models that can reconstruct the distribution of data.
Diffusion models are modeled as a Markov process.
Training objective of diffusion models can be derived as an autoencoder.
Recently, a text-to-image model has demonstrated remarkable performance on image synthesis controlled by natural language.

Prompting text-to-image diffusion model

Pre-trained diffusion model contains enough information to sample from data distribution
Text-to-image model has enough high-level knowledge due to weak supervision of natural language during pretraining
Goal is to exploit knowledge of well-trained text-conditioned model and transfer to downstream visual perception tasks
Prediction model rewritten as p(y|x), where y is task-specific label and x is input image
Build connection between task-specific label and natural language to extract learned semantic information
Text features extracted using CLIP text encoder and text adapter implemented as two-layer MLP
Prediction head implemented as Semantic FPN
Method not diffusion-based framework, only uses single UNet as backbone

Semantic guidance via cross-attention

Proposed to use cross-attention map as explicit semantic guidance
Cross-attention maps have good locality
Averaged cross-attention map aggregates semantic information of a certain category
Concatenate averaged cross-attention maps with original hierarchical feature maps
Explicit semantic guidance helps model adapt to downstream tasks

Implementation

Three visual perception tasks are considered: semantic segmentation, referring image segmentation, and depth estimation.
Similar architecture is used for all tasks, but with minor design differences.
Procedure to obtain conditioning inputs C differs for each task.
Output channels of task-specific head p φ3 (y|F) are different.
Training objectives for the three tasks are varied.

Experiments

Conducted experiments on three visual tasks
Tasks cover high-level and low-level visual perception
Results and ablation studies provided

Experiment setups

Common configurations of VPD are provided
VQGAN encoder and CLIP text encoder are fixed during training
Learning rate of θ is set to 1/10 of base learning rate
γ is set to 1e-4 for text adapter
Semantic Segmentation is evaluated on ADE20K
Referring Image Segmentation is evaluated on RefCOCO, Ref-COCO+, and G-Ref
Depth Estimation is evaluated on NYUv2
Training batch size is 16 for Semantic Segmentation, 32 for Referring Image Segmentation, and 24 for Depth Estimation
Learning rate is 1e-4 for Semantic Segmentation, 5e-5 for Referring Image Segmentation, and 5e-4 for Depth Estimation

Main results

Evaluated VPD on three downstream tasks: semantic segmentation, referring image segmentation, and depth estimation
Semantic segmentation: VPD outperformed pre-trained ConvNeXt-XL model with 53.7 mIoU ss and 54.6 mIoU ms
Referring image segmentation: VPD outperformed previous methods with large margins on RefCOCO, RefCOCO+, and G-Ref datasets
VPD achieved better results due to pre-trained language models that interact with the visual modality
VPD can quickly adapt to downstream visual perception tasks

Analysis

Evaluated effectiveness of components of VPD
Found that text prompts can build connection between visual and language domains
Text adapter and cross-attention maps improved performance
Cross-attention maps from upsampling blocks more beneficial
Pre-trained weights of text-to-image diffusion model affect performance
Computational cost of VPD is high

Conclusion

Proposed a new framework called VPD to transfer high-level knowledge of pre-trained text-to-image diffusion model to downstream tasks
Encouraged visual-language alignment and prompted pre-trained model implicitly and explicitly
Experiments on semantic segmentation, referring image segmentation, and depth estimation showed VPD can achieve competitive performance and faster convergence
Exploited low-level and high-level pre-trained knowledge and can be applied to a variety of visual perception tasks
Compared performance of VPD with previous models with different architectures and pre-training methods

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

Preliminaries: diffusion models#

Prompting text-to-image diffusion model#

Semantic guidance via cross-attention#

Implementation#

Experiments#

Experiment setups#

Main results#

Analysis#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

Preliminaries: diffusion models

Prompting text-to-image diffusion model

Semantic guidance via cross-attention

Implementation

Experiments

Experiment setups

Main results

Analysis

Conclusion