Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Proposes a unified diffusion framework to fit multi-modal data in one model
  • Learns diffusion models for marginal, conditional, and joint distributions by predicting the noise in the perturbed data
  • Perturbation levels can be different for different modalities
  • Learns all distributions simultaneously with a minimal modification to the original diffusion model
  • Implemented on large-scale paired image-text data
  • Able to perform image, text, text-to-image, image-to-text, and image-text pair generation
  • Produces perceptually realistic samples in all tasks
  • Quantitative results are superior to existing general-purpose models and comparable to bespoken models

Paper Content

Introduction

  • Content-creation revolution driven by advances in generative modeling
  • Diffusion models create high-fidelity and diverse data
  • Humans can generate multi-modal content simultaneously
  • Unified training framework needed to cover all types of multi-modal generative tasks
  • Probabilistic modeling used to fit relevant distributions
  • UniDiffuser framework proposed to fit all distributions in one model
  • UniDiffuser uses transformer-based backbone
  • UniDiffuser able to perform image, text, text-to-image, image-to-text, and image-text pair generation
  • UniDiffuser produces perceptually realistic samples in all tasks

Background

  • Diffusion models perturb data by injecting noise
  • Noise is formalized by a Markov chain
  • Data can be generated by reversing the process
  • Optimal mean is estimated by a noise prediction network
  • Classifier-free guidance improves sample quality of a conditional diffusion model

Method

  • UniDiffuser is a single diffusion model to capture marginal, conditional, and joint distributions determined by multi-modal data
  • UniDiffuser can be extended to more modalities
  • UniDiffuser is able to capture all relevant distributions determined by two modalities of data
  • UniDiffuser is equivalent to estimating a conditional expectation over the noise
  • UniDiffuser employs a joint noise prediction network to predict the noise injected to two modalities
  • UniDiffuser uses a transformer-based network
  • UniDiffuser can perform unconditional, conditional, and joint sampling
  • UniDiffuser is more efficient than learning a single joint distribution

Classifier-free guidance for free

  • CFG combines a conditional and an unconditional model linearly during sampling.
  • CFG improves sample quality and image-text alignment in diffusion models.
  • CFG is applicable to UniDiffuser without modifying the training process.
  • CFG is applicable to joint sampling.

Encoding images and texts into latent space

  • Image encoder-decoder consists of an image autoencoder and image CLIP
  • Text encoder-decoder consists of text CLIP and GPT-2
  • Text CLIP outputs 77 vectors of 768 dimensions, which are reduced to 64 for the final text embedding

Transformer as joint noise prediction network

  • Train a joint noise prediction network on embeddings
  • Employ transformer-based backbone (U-ViT)
  • Treat data and timesteps as tokens
  • Use post-layer normalization and add layer normalization after long skip connection

Experiments

  • UniDiffuser can perform multiple generation tasks
  • UniDiffuser is compared to existing large models
  • UniDiffuser supports data variation, blocked Gibbs sampling, and interpolation between images

Setup

  • Used three subsets of LAION-5B dataset
  • Fine-tuned model with 200K steps at 512x512 resolution
  • Trained 220K steps at 512x512 resolution
  • Used AdamW optimizer with learning rate of 2e-4 and weight decay of 0.03
  • Compared to Versatile Diffusion (VD)
  • Reported FID and CLIP score on MS-COCO validation set for text-to-image generation
  • Reported CLIP score on randomly drawn 10K images for image-to-text generation

Main results

  • UniDiffuser outperforms Versatile Diffusion (VD) in both text-to-image and image-to-text generation
  • UniDiffuser is simpler, more efficient and more general than VD
  • UniDiffuser is comparable to bespoken diffusion models for text-to-image generation and outperforms famous diffusion models
  • UniDiffuser is capable of joint, conditional, and unconditional generation

Data variation and gibbs sampling

  • UniDiffuser supports applications such as image and text variation
  • Examples of image and text variation are presented in Figure 1 (f-g)
  • Blocked Gibbs sampling can be used to see how images and texts are translated to each other, examples in Figure 1 (h)

Interpolation between two images in the wild

  • UniDiffuser can interpolate between two images.
  • Imageto-text generation is used to obtain latent text embeddings of the two images.
  • DPM-Solver is used with the same Gaussian noise as the initial state for both images.

Conclusion

  • Proposed UniDiffuser, a general-purpose multi-modal probabilistic framework
  • Able to perform various generation tasks with minimal modification
  • Empirical results show effectiveness compared to existing models
  • Enables semi-supervised learning and learning on more modalities
  • Text generated is not smooth due to data being noisy
  • Can advance real-world applications with generated content
  • Watermark images and provide protocol to relieve deepfake problem
  • Algorithms 1-5 present training and sampling procedures
  • Finetune GPT-2 text decoder on LAION-2B-en dataset
  • Embedding dimension of y0 is 64, BLEU-1 score of 0.969, BLEU-4 score of 0.894
  • Examples of joint generation, text-to-image, image-to-text, unconditional image and text generation, image and text variation, blocked Gibbs sampling and interpolation
  • Connect results with same scale in CFG, UniDiffuser outperforms VD