Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Recent advances in diffusion models have set a milestone in many generation tasks
- New approaches focus on extensions and performance rather than capacity
- Versatile Diffusion (VD) is a multi-flow network that handles text-to-image, image-to-text, image-variation, and text-variation in one unified model
- VD has competitive quality, novel extensions and applications, and provides more semantic insights of the generated outputs
- Code and models are open-sourced
Paper Content
Introduction
- Multi-modality is a challenge for computer vision and machine learning
- Deep learning has improved accuracy of traditional tasks
- Multi-modal research has focused on discriminative tasks
- Generative tasks of a large scope are challenging
- GAN research has focused on specific domains and tasks
- Diffusion models have been successful across many domains and tasks
- Diffusion models have robust training objectives
- Diffusion models have competitive performance
- Diffusion models have disadvantages such as data hunger and high inference costs
- Diffusion models have achieved smooth translation in cross-modal latent spaces
- Versatile Diffusion (VD) is introduced to solve text, images, and variations in one unified model
Related works
- Multi-modalities are unions of information with different forms, including vision, text, audio, etc.
- Early deep learning work learned a fused representation for audio and video.
- Zero-shot learning maps images on semantic space from which unseen category labels can be predicted.
- Multimodal approaches increase classification accuracy via multimodal training.
- Multimodal training is also used in detection and segmentation.
- VQA conducts cross-modal reasoning that transfers visual concepts into linguistic answers.
- Multimodal generative tasks are formalized as representation learning plus generation.
- Diffusion models consolidate multiple methods including VAEs, Markov chains, and score matching models.
- Recent works have improved sampling quality and efficiency for text-to-image generation.
Method
- Review fundamentals of diffusion models
- Present multi-flow multimodal framework
- Explain Versatile Diffusion (VD)
Diffusion basics
- Forward diffusion process is a Markov Chain with T steps that gradually degrade x 0 to x T with random Gaussian noises.
- Backward diffusion process is used to recover signal x 0 by removing the added Gaussian noises.
- Objective function to train a diffusion model is to minimize the variational bound for negative loglikelihood.
- In practice, many works assume deterministic α t and β t for step t in Equation 1.
Multi-flow multimodal diffusion framework
- Proposed framework is a multi-flow network with various types of data as input and context
- Framework closely follows Latent Diffusion Model (LDM) and Stable Diffusion (SD)
- Framework inherits merits of LDM/SD with interpretable latent space, modulized structure, and lower computation cost
- Framework is designed to jointly train multiple flows, each representing a crossmodal task
- Diffuser layers are grouped into global, data, and context layers
Versatile diffusion
- VD is a unified diffusion model for text-to-image, image-variation, image-to-text, and text-variation
- VD contains two full streams of VAEs, diffusers, and context encoders
- Diffuser uses UNet with cross attentions
- VAE uses Autoencoder-KL for image data and Optimus for text data
- Context encoder uses CLIP text and image encoders
Experiments
- Describes training data and settings for VD
- Shows performance of VD on supported tasks
- Introduces novel downstream applications enabled by VD
Dataset
- Used Laion2B-en as VD’s training dataset
- Laion2B-en is a collection of nearly two billion images with English captions
- Images and captions were filtered using criteria such as CLIP similarity, safety scores, watermark probability, aspect ratios, and image area
- Caption cleaning algorithm was used to train VD on image-to-text and text-variation tasks
Training
- Proposed multi-flow multimodal framework
- Trained VD with three settings: basic, dualcontext (DC), and official
- VD-basic is an image-variation model with a single-flow
- VD-DC is a two-flow model that supports text-to-image and image-variation
- VD-official is a four-flow model that includes two more tasks, i.e. image-to-text and text-variation
- Used pre-trained weights from SD checkpoint v1.4
- Different gradient multipliers for different layers and streams
- Initially trained on resolution 256 for 30 million samples and further trained on resolution 512 for 6.4 million samples
Performance
- Introduced multi-flow multimodal diffusion models
- Compared VD’s results with baseline models
- Conducted qualitative comparisons between different VD models
- Concluded that VD handles all subtasks well
Disentanglement of style and semantic
- VD can enhance or reduce image styles without supervision
- Exploring disentanglement between styles and semantics on images with arbitrary contents and styles
- Prior works explored similar properties in GAN latent spaces, but only on well-aligned data
- VD-DC and VD-official serve similar disentanglement performance, VD-basic has slightly decreased results
Dual-guided generation
- Dual-guided generation is a downstream application that VD supports.
- VD can generate outputs conditioned on both image and text.
- Model ensembling is a simple baseline, but results are unsatisfactory.
- VD can guide cross-modal conditionings on a deeper level.
- Attention-level mixing on VD yields the best performance.
Editable i2t2i
- VD supports image-to-text and text-to-image
- Prototype experiment to edit images from text prompts
- No masks needed, automatically locates and substitutes objects
- Output images do not match input images pixel by pixel
- First group to conduct image editing task combining image-to-text, text editing, and text-to-image
Conclusion
- Proposed a novel diffusion model, Versatile Diffusion, that handles text, image, and variations all in one
- Proposed a multi-flow multimodal framework that can be extended to new tasks and domains
- Experiments and applications demonstrate that VD performs well on all supported tasks
- Core strategy of the disentanglement is to manipulate the 257x768 CLIP image context embedding
- Split vector into global vector and 256 local vectors
- Major principal components of the matrix hold the style information, remaining principal components hold the semantic information
- Generate image variations with style focuses from guidance of low-rank context embedding
- Generate image variations with semantic focuses by removing major principal components from context embeddings
- Dual-guided generation for VD is to generate images or sentences through guidance of both image context and prompt context
- Mixing strategies: layer-level or context-level
- Resolve conflict between contexts with attention-level mixing
- Editable I2T2I application to modify latent text vectors
- Limited latent space of Optimus VAE and imperfect data limit VD’s performance
- Future research directions: expand scope and capacity of Optimus VAE, prepare finetuned dataset, prepare finetuned text VAE