Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Language-guided image generation has been successful using diffusion models.
  • Texts can be too vague to accurately describe specific subjects.
  • UMM-Diffusion takes joint texts and images as input and generates customized images.
  • Input images are projected to pseudo word embedding and combined with text to guide image generation.
  • Sampling technique of diffusion models used to eliminate irrelevant parts of input images.
  • Leveraging pre-trained text-to-image generator and image encoder to generate high-quality images with complex semantics.

Paper Content

Introduction

  • Synthesizing images with customized subjects is difficult
  • Input data format is text and images
  • Existing text-to-image models cannot achieve this goal
  • Recent methods use a special text token to represent the input subject
  • These methods require time and computing resources and high-resolution images
  • Proposed method takes joint texts and images in one sequence as input
  • Encodes them into one unified multi-modal latent space
  • Novel sampling technique of diffusion models to mitigate overfitting problem
  • Results of multi-modal guidance and pure text guidance fused in one denoising step

Compositing subjects into scenes

  • Extracting a subject from an image and compositing it into a new scene is a common scenario in image editing.
  • Traditional methods use statistic features to align the subjects and the new scenes.
  • Deep learning methods use deep features to achieve better performance.

Text conditional image generation

  • Generating semantically aligned images conditioned by texts has drawn attention
  • Early methods used GANs and were trained on limited data
  • Recent methods use pre-trained text encoders and large-scale generators
  • Diffusion models have developed rapidly and generate high quality and diverse images
  • Diffusion-based text-to-image generation methods have achieved impressive results
  • These methods require large datasets and great computing resources
  • However, they are still limited in synthesizing personalized results

Customized image generation

  • Several methods have been proposed to provide more control of the image generation process.
  • IC-GAN uses the representation of certain instances as conditions to generate new images.
  • TI and DreamBooth can take both texts and subjects provided by images as conditions to create new images.
  • TI and DreamBooth require a lot of time and computing resources.

Unified multi-modal latent diffusion

  • Aim to generate high-quality images semantically aligned with input texts.
  • Encode texts and images into a unified multi-modal latent embedding.
  • Challenges include: different modalities, irrelevant information, lack of data.

Preliminaries

  • Diffusion models map arbitrary data distribution to Gaussian distribution by adding noise
  • Reverse process uses noise-prediction model to recover noise of t-step noisy image
  • Additional condition (e.g. captions) can be provided to noise-prediction model
  • Text encoding is vital to image quality and semantic alignment
  • CLIP is a widely used encoder for extracting semantic representation from texts and images

Fusing sampling technique

  • TIUE takes the whole image as input, including both the subject and the irrelevant background.
  • Overfitting is a challenge that needs to be addressed.
  • Text input is used to help solve the overfitting problem.
  • A fusing sampling technique is proposed to combine the multimodal guidance and pure text guidance.

Dataset building and model initialization

  • Train model using LAION-400M dataset
  • Automatically crop sub-images from images in dataset
  • Retrieve labels of sub-images from captions
  • Filter dataset to 1.8M training sets with rules

Experiments

Implementation details

  • Trained TIUE for 200k iterations and then trained the whole model for another 200k iterations
  • Batch size of 192 and learning rate of 1e-5 for both phases
  • Used EMA rate of 0.9999 to stabilize training process
  • Fuse ratio ฮฑ set to 0.5 during inference
  • First to design unified encoder that takes joint subjects from images and texts for image generation

Applications

  • Generates images described by input text and containing objects from input images
  • Can generate diverse novel views of target subject
  • Style of result can be assigned by input text or image
  • Model can disentangle information of text and image
  • Allows users to provide multiple images in one input set

Comparisons

  • Proposed unified encoder to achieve multi-modality image generation
  • Designed text-to-image baselines to compare
  • Compared to state-of-the-art few-shot finetuning-based method
  • No online training cost
  • Results prove better results without time-costing finetuning process
  • Fusing sampling technique proposed to show trade-off on choice of alpha
  • Alpha of 0.5 leads to better trade-off between input texts and images
  • Drawbacks of method: multi-object decomposition and rare/highly-factitious subjects
  • UMM-Diffusion framework for joint subject and text conditional image generation
  • CLIP Encoders used to extract semantics of texts and images
  • Fusing sampling technique used as guidance of generation process
  • Model initialized by pre-trained text-to-image generation model
  • Experiments demonstrate efficiency of method
  • Training on 32 NVIDIA V100 32G GPUs for 7 days
  • Adam optimizer used during training
  • Scale factor of classiferfree guidance set to 7.5
  • CLIP Encoders ViT-L/14 based
  • Noise-prediction model architecture same to Stable Diffusion v1-5
  • Results show tolerance to low-quality input images
  • Results show ability to synthesize images with customized style
  • Visualization of pseudo word embeddings to prove ability to extract similar embeddings from images of one class and different embeddings from images of different classes