Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Recent advances in diffusion models have set a milestone in many generation tasks
  • New approaches focus on extensions and performance rather than capacity
  • Versatile Diffusion (VD) is a multi-flow network that handles text-to-image, image-to-text, image-variation, and text-variation in one unified model
  • VD has competitive quality, novel extensions and applications, and provides more semantic insights of the generated outputs
  • Code and models are open-sourced

Paper Content

Introduction

  • Multi-modality is a challenge for computer vision and machine learning
  • Deep learning has improved accuracy of traditional tasks
  • Multi-modal research has focused on discriminative tasks
  • Generative tasks of a large scope are challenging
  • GAN research has focused on specific domains and tasks
  • Diffusion models have been successful across many domains and tasks
  • Diffusion models have robust training objectives
  • Diffusion models have competitive performance
  • Diffusion models have disadvantages such as data hunger and high inference costs
  • Diffusion models have achieved smooth translation in cross-modal latent spaces
  • Versatile Diffusion (VD) is introduced to solve text, images, and variations in one unified model
  • Multi-modalities are unions of information with different forms, including vision, text, audio, etc.
  • Early deep learning work learned a fused representation for audio and video.
  • Zero-shot learning maps images on semantic space from which unseen category labels can be predicted.
  • Multimodal approaches increase classification accuracy via multimodal training.
  • Multimodal training is also used in detection and segmentation.
  • VQA conducts cross-modal reasoning that transfers visual concepts into linguistic answers.
  • Multimodal generative tasks are formalized as representation learning plus generation.
  • Diffusion models consolidate multiple methods including VAEs, Markov chains, and score matching models.
  • Recent works have improved sampling quality and efficiency for text-to-image generation.

Method

  • Review fundamentals of diffusion models
  • Present multi-flow multimodal framework
  • Explain Versatile Diffusion (VD)

Diffusion basics

  • Forward diffusion process is a Markov Chain with T steps that gradually degrade x 0 to x T with random Gaussian noises.
  • Backward diffusion process is used to recover signal x 0 by removing the added Gaussian noises.
  • Objective function to train a diffusion model is to minimize the variational bound for negative loglikelihood.
  • In practice, many works assume deterministic α t and β t for step t in Equation 1.

Multi-flow multimodal diffusion framework

  • Proposed framework is a multi-flow network with various types of data as input and context
  • Framework closely follows Latent Diffusion Model (LDM) and Stable Diffusion (SD)
  • Framework inherits merits of LDM/SD with interpretable latent space, modulized structure, and lower computation cost
  • Framework is designed to jointly train multiple flows, each representing a crossmodal task
  • Diffuser layers are grouped into global, data, and context layers

Versatile diffusion

  • VD is a unified diffusion model for text-to-image, image-variation, image-to-text, and text-variation
  • VD contains two full streams of VAEs, diffusers, and context encoders
  • Diffuser uses UNet with cross attentions
  • VAE uses Autoencoder-KL for image data and Optimus for text data
  • Context encoder uses CLIP text and image encoders

Experiments

  • Describes training data and settings for VD
  • Shows performance of VD on supported tasks
  • Introduces novel downstream applications enabled by VD

Dataset

  • Used Laion2B-en as VD’s training dataset
  • Laion2B-en is a collection of nearly two billion images with English captions
  • Images and captions were filtered using criteria such as CLIP similarity, safety scores, watermark probability, aspect ratios, and image area
  • Caption cleaning algorithm was used to train VD on image-to-text and text-variation tasks

Training

  • Proposed multi-flow multimodal framework
  • Trained VD with three settings: basic, dualcontext (DC), and official
  • VD-basic is an image-variation model with a single-flow
  • VD-DC is a two-flow model that supports text-to-image and image-variation
  • VD-official is a four-flow model that includes two more tasks, i.e. image-to-text and text-variation
  • Used pre-trained weights from SD checkpoint v1.4
  • Different gradient multipliers for different layers and streams
  • Initially trained on resolution 256 for 30 million samples and further trained on resolution 512 for 6.4 million samples

Performance

  • Introduced multi-flow multimodal diffusion models
  • Compared VD’s results with baseline models
  • Conducted qualitative comparisons between different VD models
  • Concluded that VD handles all subtasks well

Disentanglement of style and semantic

  • VD can enhance or reduce image styles without supervision
  • Exploring disentanglement between styles and semantics on images with arbitrary contents and styles
  • Prior works explored similar properties in GAN latent spaces, but only on well-aligned data
  • VD-DC and VD-official serve similar disentanglement performance, VD-basic has slightly decreased results

Dual-guided generation

  • Dual-guided generation is a downstream application that VD supports.
  • VD can generate outputs conditioned on both image and text.
  • Model ensembling is a simple baseline, but results are unsatisfactory.
  • VD can guide cross-modal conditionings on a deeper level.
  • Attention-level mixing on VD yields the best performance.

Editable i2t2i

  • VD supports image-to-text and text-to-image
  • Prototype experiment to edit images from text prompts
  • No masks needed, automatically locates and substitutes objects
  • Output images do not match input images pixel by pixel
  • First group to conduct image editing task combining image-to-text, text editing, and text-to-image

Conclusion

  • Proposed a novel diffusion model, Versatile Diffusion, that handles text, image, and variations all in one
  • Proposed a multi-flow multimodal framework that can be extended to new tasks and domains
  • Experiments and applications demonstrate that VD performs well on all supported tasks
  • Core strategy of the disentanglement is to manipulate the 257x768 CLIP image context embedding
  • Split vector into global vector and 256 local vectors
  • Major principal components of the matrix hold the style information, remaining principal components hold the semantic information
  • Generate image variations with style focuses from guidance of low-rank context embedding
  • Generate image variations with semantic focuses by removing major principal components from context embeddings
  • Dual-guided generation for VD is to generate images or sentences through guidance of both image context and prompt context
  • Mixing strategies: layer-level or context-level
  • Resolve conflict between contexts with attention-level mixing
  • Editable I2T2I application to modify latent text vectors
  • Limited latent space of Optimus VAE and imperfect data limit VD’s performance
  • Future research directions: expand scope and capacity of Optimus VAE, prepare finetuned dataset, prepare finetuned text VAE