Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Diffusion models are a type of generative model that can be used to create images from text.
- VPD is a new framework that uses a pre-trained text-to-image diffusion model for visual perception tasks.
- VPD uses an adapter to refine text features and cross-attention maps to provide guidance.
- VPD achieves good results on semantic segmentation, referring image segmentation and depth estimation.
Paper Content
Introduction
- Large text-to-image diffusion models can generate high-quality images with rich texture, diverse content and reasonable structures.
- Text-to-image diffusion models can implicitly learn both high-level and low-level visual concepts from massive image-text pairs.
- Text-to-image models are designed to generate high-fidelity images based on textual prompts.
- Conventional visual pre-training methods aim to encode the input image as latent representations.
- Text-to-image models take as input random noises and text prompts and aim to produce images through a progressive denoising process.
- Extracting the visual knowledge learned by large diffusion models for visual perception tasks is non-trivial.
- VPD framework proposed to adapt pre-trained diffusion models for visual perception tasks.
- VPD framework evaluated on three visual perception tasks: semantic segmentation, referring image segmentation, and depth estimation.
- VPD framework outperforms existing pre-training methods on visual perception tasks.
Related work
- Diffusion models are a new family of generative models with good synthesis quality and controllability
- Training a denoising autoencoder can be used to learn the inverse of a Markovian diffusion process
- Sampling from a diffusion model is a progressive denoising procedure
- Latent diffusion models reduce computational costs and allow for training text-to-image diffusion models
- Pre-training and fine-tuning can be used to facilitate downstream visual perception tasks
- Text-to-image generation can be used as an alternative for visual pre-training
Method
- Presents VPD, a new framework for visual perception
- Leverages pre-trained text-to-image diffusion model to extract high-level knowledge
Preliminaries: diffusion models
- Diffusion models are a new family of generative models that can reconstruct the distribution of data.
- Diffusion models are modeled as a Markov process.
- Training objective of diffusion models can be derived as an autoencoder.
- Recently, a text-to-image model has demonstrated remarkable performance on image synthesis controlled by natural language.
Prompting text-to-image diffusion model
- Pre-trained diffusion model contains enough information to sample from data distribution
- Text-to-image model has enough high-level knowledge due to weak supervision of natural language during pretraining
- Goal is to exploit knowledge of well-trained text-conditioned model and transfer to downstream visual perception tasks
- Prediction model rewritten as p(y|x), where y is task-specific label and x is input image
- Build connection between task-specific label and natural language to extract learned semantic information
- Text features extracted using CLIP text encoder and text adapter implemented as two-layer MLP
- Prediction head implemented as Semantic FPN
- Method not diffusion-based framework, only uses single UNet as backbone
Semantic guidance via cross-attention
- Proposed to use cross-attention map as explicit semantic guidance
- Cross-attention maps have good locality
- Averaged cross-attention map aggregates semantic information of a certain category
- Concatenate averaged cross-attention maps with original hierarchical feature maps
- Explicit semantic guidance helps model adapt to downstream tasks
Implementation
- Three visual perception tasks are considered: semantic segmentation, referring image segmentation, and depth estimation.
- Similar architecture is used for all tasks, but with minor design differences.
- Procedure to obtain conditioning inputs C differs for each task.
- Output channels of task-specific head p φ3 (y|F) are different.
- Training objectives for the three tasks are varied.
Experiments
- Conducted experiments on three visual tasks
- Tasks cover high-level and low-level visual perception
- Results and ablation studies provided
Experiment setups
- Common configurations of VPD are provided
- VQGAN encoder and CLIP text encoder are fixed during training
- Learning rate of θ is set to 1/10 of base learning rate
- γ is set to 1e-4 for text adapter
- Semantic Segmentation is evaluated on ADE20K
- Referring Image Segmentation is evaluated on RefCOCO, Ref-COCO+, and G-Ref
- Depth Estimation is evaluated on NYUv2
- Training batch size is 16 for Semantic Segmentation, 32 for Referring Image Segmentation, and 24 for Depth Estimation
- Learning rate is 1e-4 for Semantic Segmentation, 5e-5 for Referring Image Segmentation, and 5e-4 for Depth Estimation
Main results
- Evaluated VPD on three downstream tasks: semantic segmentation, referring image segmentation, and depth estimation
- Semantic segmentation: VPD outperformed pre-trained ConvNeXt-XL model with 53.7 mIoU ss and 54.6 mIoU ms
- Referring image segmentation: VPD outperformed previous methods with large margins on RefCOCO, RefCOCO+, and G-Ref datasets
- VPD achieved better results due to pre-trained language models that interact with the visual modality
- VPD can quickly adapt to downstream visual perception tasks
Analysis
- Evaluated effectiveness of components of VPD
- Found that text prompts can build connection between visual and language domains
- Text adapter and cross-attention maps improved performance
- Cross-attention maps from upsampling blocks more beneficial
- Pre-trained weights of text-to-image diffusion model affect performance
- Computational cost of VPD is high
Conclusion
- Proposed a new framework called VPD to transfer high-level knowledge of pre-trained text-to-image diffusion model to downstream tasks
- Encouraged visual-language alignment and prompted pre-trained model implicitly and explicitly
- Experiments on semantic segmentation, referring image segmentation, and depth estimation showed VPD can achieve competitive performance and faster convergence
- Exploited low-level and high-level pre-trained knowledge and can be applied to a variety of visual perception tasks
- Compared performance of VPD with previous models with different architectures and pre-training methods