Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Proposes a unified diffusion framework to fit multi-modal data in one model
- Learns diffusion models for marginal, conditional, and joint distributions by predicting the noise in the perturbed data
- Perturbation levels can be different for different modalities
- Learns all distributions simultaneously with a minimal modification to the original diffusion model
- Implemented on large-scale paired image-text data
- Able to perform image, text, text-to-image, image-to-text, and image-text pair generation
- Produces perceptually realistic samples in all tasks
- Quantitative results are superior to existing general-purpose models and comparable to bespoken models
Paper Content
Introduction
- Content-creation revolution driven by advances in generative modeling
- Diffusion models create high-fidelity and diverse data
- Humans can generate multi-modal content simultaneously
- Unified training framework needed to cover all types of multi-modal generative tasks
- Probabilistic modeling used to fit relevant distributions
- UniDiffuser framework proposed to fit all distributions in one model
- UniDiffuser uses transformer-based backbone
- UniDiffuser able to perform image, text, text-to-image, image-to-text, and image-text pair generation
- UniDiffuser produces perceptually realistic samples in all tasks
Background
- Diffusion models perturb data by injecting noise
- Noise is formalized by a Markov chain
- Data can be generated by reversing the process
- Optimal mean is estimated by a noise prediction network
- Classifier-free guidance improves sample quality of a conditional diffusion model
Method
- UniDiffuser is a single diffusion model to capture marginal, conditional, and joint distributions determined by multi-modal data
- UniDiffuser can be extended to more modalities
- UniDiffuser is able to capture all relevant distributions determined by two modalities of data
- UniDiffuser is equivalent to estimating a conditional expectation over the noise
- UniDiffuser employs a joint noise prediction network to predict the noise injected to two modalities
- UniDiffuser uses a transformer-based network
- UniDiffuser can perform unconditional, conditional, and joint sampling
- UniDiffuser is more efficient than learning a single joint distribution
Classifier-free guidance for free
- CFG combines a conditional and an unconditional model linearly during sampling.
- CFG improves sample quality and image-text alignment in diffusion models.
- CFG is applicable to UniDiffuser without modifying the training process.
- CFG is applicable to joint sampling.
Encoding images and texts into latent space
- Image encoder-decoder consists of an image autoencoder and image CLIP
- Text encoder-decoder consists of text CLIP and GPT-2
- Text CLIP outputs 77 vectors of 768 dimensions, which are reduced to 64 for the final text embedding
Transformer as joint noise prediction network
- Train a joint noise prediction network on embeddings
- Employ transformer-based backbone (U-ViT)
- Treat data and timesteps as tokens
- Use post-layer normalization and add layer normalization after long skip connection
Related work
Experiments
- UniDiffuser can perform multiple generation tasks
- UniDiffuser is compared to existing large models
- UniDiffuser supports data variation, blocked Gibbs sampling, and interpolation between images
Setup
- Used three subsets of LAION-5B dataset
- Fine-tuned model with 200K steps at 512x512 resolution
- Trained 220K steps at 512x512 resolution
- Used AdamW optimizer with learning rate of 2e-4 and weight decay of 0.03
- Compared to Versatile Diffusion (VD)
- Reported FID and CLIP score on MS-COCO validation set for text-to-image generation
- Reported CLIP score on randomly drawn 10K images for image-to-text generation
Main results
- UniDiffuser outperforms Versatile Diffusion (VD) in both text-to-image and image-to-text generation
- UniDiffuser is simpler, more efficient and more general than VD
- UniDiffuser is comparable to bespoken diffusion models for text-to-image generation and outperforms famous diffusion models
- UniDiffuser is capable of joint, conditional, and unconditional generation
Data variation and gibbs sampling
- UniDiffuser supports applications such as image and text variation
- Examples of image and text variation are presented in Figure 1 (f-g)
- Blocked Gibbs sampling can be used to see how images and texts are translated to each other, examples in Figure 1 (h)
Interpolation between two images in the wild
- UniDiffuser can interpolate between two images.
- Imageto-text generation is used to obtain latent text embeddings of the two images.
- DPM-Solver is used with the same Gaussian noise as the initial state for both images.
Conclusion
- Proposed UniDiffuser, a general-purpose multi-modal probabilistic framework
- Able to perform various generation tasks with minimal modification
- Empirical results show effectiveness compared to existing models
- Enables semi-supervised learning and learning on more modalities
- Text generated is not smooth due to data being noisy
- Can advance real-world applications with generated content
- Watermark images and provide protocol to relieve deepfake problem
- Algorithms 1-5 present training and sampling procedures
- Finetune GPT-2 text decoder on LAION-2B-en dataset
- Embedding dimension of y0 is 64, BLEU-1 score of 0.969, BLEU-4 score of 0.894
- Examples of joint generation, text-to-image, image-to-text, unconditional image and text generation, image and text variation, blocked Gibbs sampling and interpolation
- Connect results with same scale in CFG, UniDiffuser outperforms VD