Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Recent advances in diffusion models have set a milestone in many generation tasks
New approaches focus on extensions and performance rather than capacity
Versatile Diffusion (VD) is a multi-flow network that handles text-to-image, image-to-text, image-variation, and text-variation in one unified model
VD has competitive quality, novel extensions and applications, and provides more semantic insights of the generated outputs
Code and models are open-sourced

Paper Content

Introduction

Multi-modality is a challenge for computer vision and machine learning
Deep learning has improved accuracy of traditional tasks
Multi-modal research has focused on discriminative tasks
Generative tasks of a large scope are challenging
GAN research has focused on specific domains and tasks
Diffusion models have been successful across many domains and tasks
Diffusion models have robust training objectives
Diffusion models have competitive performance
Diffusion models have disadvantages such as data hunger and high inference costs
Diffusion models have achieved smooth translation in cross-modal latent spaces
Versatile Diffusion (VD) is introduced to solve text, images, and variations in one unified model

Multi-modalities are unions of information with different forms, including vision, text, audio, etc.
Early deep learning work learned a fused representation for audio and video.
Zero-shot learning maps images on semantic space from which unseen category labels can be predicted.
Multimodal approaches increase classification accuracy via multimodal training.
Multimodal training is also used in detection and segmentation.
VQA conducts cross-modal reasoning that transfers visual concepts into linguistic answers.
Multimodal generative tasks are formalized as representation learning plus generation.
Diffusion models consolidate multiple methods including VAEs, Markov chains, and score matching models.
Recent works have improved sampling quality and efficiency for text-to-image generation.

Method

Review fundamentals of diffusion models
Present multi-flow multimodal framework
Explain Versatile Diffusion (VD)

Diffusion basics

Forward diffusion process is a Markov Chain with T steps that gradually degrade x 0 to x T with random Gaussian noises.
Backward diffusion process is used to recover signal x 0 by removing the added Gaussian noises.
Objective function to train a diffusion model is to minimize the variational bound for negative loglikelihood.
In practice, many works assume deterministic α t and β t for step t in Equation 1.

Multi-flow multimodal diffusion framework

Proposed framework is a multi-flow network with various types of data as input and context
Framework closely follows Latent Diffusion Model (LDM) and Stable Diffusion (SD)
Framework inherits merits of LDM/SD with interpretable latent space, modulized structure, and lower computation cost
Framework is designed to jointly train multiple flows, each representing a crossmodal task
Diffuser layers are grouped into global, data, and context layers

Versatile diffusion

VD is a unified diffusion model for text-to-image, image-variation, image-to-text, and text-variation
VD contains two full streams of VAEs, diffusers, and context encoders
Diffuser uses UNet with cross attentions
VAE uses Autoencoder-KL for image data and Optimus for text data
Context encoder uses CLIP text and image encoders

Experiments

Describes training data and settings for VD
Shows performance of VD on supported tasks
Introduces novel downstream applications enabled by VD

Dataset

Used Laion2B-en as VD’s training dataset
Laion2B-en is a collection of nearly two billion images with English captions
Images and captions were filtered using criteria such as CLIP similarity, safety scores, watermark probability, aspect ratios, and image area
Caption cleaning algorithm was used to train VD on image-to-text and text-variation tasks

Training

Proposed multi-flow multimodal framework
Trained VD with three settings: basic, dualcontext (DC), and official
VD-basic is an image-variation model with a single-flow
VD-DC is a two-flow model that supports text-to-image and image-variation
VD-official is a four-flow model that includes two more tasks, i.e. image-to-text and text-variation
Used pre-trained weights from SD checkpoint v1.4
Different gradient multipliers for different layers and streams
Initially trained on resolution 256 for 30 million samples and further trained on resolution 512 for 6.4 million samples

Performance

Introduced multi-flow multimodal diffusion models
Compared VD’s results with baseline models
Conducted qualitative comparisons between different VD models
Concluded that VD handles all subtasks well

Disentanglement of style and semantic

VD can enhance or reduce image styles without supervision
Exploring disentanglement between styles and semantics on images with arbitrary contents and styles
Prior works explored similar properties in GAN latent spaces, but only on well-aligned data
VD-DC and VD-official serve similar disentanglement performance, VD-basic has slightly decreased results

Dual-guided generation

Dual-guided generation is a downstream application that VD supports.
VD can generate outputs conditioned on both image and text.
Model ensembling is a simple baseline, but results are unsatisfactory.
VD can guide cross-modal conditionings on a deeper level.
Attention-level mixing on VD yields the best performance.

Editable i2t2i

VD supports image-to-text and text-to-image
Prototype experiment to edit images from text prompts
No masks needed, automatically locates and substitutes objects
Output images do not match input images pixel by pixel
First group to conduct image editing task combining image-to-text, text editing, and text-to-image

Conclusion

Proposed a novel diffusion model, Versatile Diffusion, that handles text, image, and variations all in one
Proposed a multi-flow multimodal framework that can be extended to new tasks and domains
Experiments and applications demonstrate that VD performs well on all supported tasks
Core strategy of the disentanglement is to manipulate the 257x768 CLIP image context embedding
Split vector into global vector and 256 local vectors
Major principal components of the matrix hold the style information, remaining principal components hold the semantic information
Generate image variations with style focuses from guidance of low-rank context embedding
Generate image variations with semantic focuses by removing major principal components from context embeddings
Dual-guided generation for VD is to generate images or sentences through guidance of both image context and prompt context
Mixing strategies: layer-level or context-level
Resolve conflict between contexts with attention-level mixing
Editable I2T2I application to modify latent text vectors
Limited latent space of Optimus VAE and imperfect data limit VD’s performance
Future research directions: expand scope and capacity of Optimus VAE, prepare finetuned dataset, prepare finetuned text VAE

Link to paper#

Abstract#

Paper Content#

Introduction#

Related works#

Method#

Diffusion basics#

Multi-flow multimodal diffusion framework#

Versatile diffusion#

Experiments#

Dataset#

Training#

Performance#

Disentanglement of style and semantic#

Dual-guided generation#

Editable i2t2i#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related works

Method

Diffusion basics

Multi-flow multimodal diffusion framework

Versatile diffusion

Experiments

Dataset

Training

Performance

Disentanglement of style and semantic

Dual-guided generation

Editable i2t2i

Conclusion