Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Language-guided image generation has been successful using diffusion models.
Texts can be too vague to accurately describe specific subjects.
UMM-Diffusion takes joint texts and images as input and generates customized images.
Input images are projected to pseudo word embedding and combined with text to guide image generation.
Sampling technique of diffusion models used to eliminate irrelevant parts of input images.
Leveraging pre-trained text-to-image generator and image encoder to generate high-quality images with complex semantics.

Paper Content

Introduction

Synthesizing images with customized subjects is difficult
Input data format is text and images
Existing text-to-image models cannot achieve this goal
Recent methods use a special text token to represent the input subject
These methods require time and computing resources and high-resolution images
Proposed method takes joint texts and images in one sequence as input
Encodes them into one unified multi-modal latent space
Novel sampling technique of diffusion models to mitigate overfitting problem
Results of multi-modal guidance and pure text guidance fused in one denoising step

Compositing subjects into scenes

Extracting a subject from an image and compositing it into a new scene is a common scenario in image editing.
Traditional methods use statistic features to align the subjects and the new scenes.
Deep learning methods use deep features to achieve better performance.

Text conditional image generation

Generating semantically aligned images conditioned by texts has drawn attention
Early methods used GANs and were trained on limited data
Recent methods use pre-trained text encoders and large-scale generators
Diffusion models have developed rapidly and generate high quality and diverse images
Diffusion-based text-to-image generation methods have achieved impressive results
These methods require large datasets and great computing resources
However, they are still limited in synthesizing personalized results

Customized image generation

Several methods have been proposed to provide more control of the image generation process.
IC-GAN uses the representation of certain instances as conditions to generate new images.
TI and DreamBooth can take both texts and subjects provided by images as conditions to create new images.
TI and DreamBooth require a lot of time and computing resources.

Aim to generate high-quality images semantically aligned with input texts.
Encode texts and images into a unified multi-modal latent embedding.
Challenges include: different modalities, irrelevant information, lack of data.

Preliminaries

Diffusion models map arbitrary data distribution to Gaussian distribution by adding noise
Reverse process uses noise-prediction model to recover noise of t-step noisy image
Additional condition (e.g. captions) can be provided to noise-prediction model
Text encoding is vital to image quality and semantic alignment
CLIP is a widely used encoder for extracting semantic representation from texts and images

Fusing sampling technique

TIUE takes the whole image as input, including both the subject and the irrelevant background.
Overfitting is a challenge that needs to be addressed.
Text input is used to help solve the overfitting problem.
A fusing sampling technique is proposed to combine the multimodal guidance and pure text guidance.

Dataset building and model initialization

Train model using LAION-400M dataset
Automatically crop sub-images from images in dataset
Retrieve labels of sub-images from captions
Filter dataset to 1.8M training sets with rules

Experiments

Implementation details

Trained TIUE for 200k iterations and then trained the whole model for another 200k iterations
Batch size of 192 and learning rate of 1e-5 for both phases
Used EMA rate of 0.9999 to stabilize training process
Fuse ratio α set to 0.5 during inference
First to design unified encoder that takes joint subjects from images and texts for image generation

Applications

Generates images described by input text and containing objects from input images
Can generate diverse novel views of target subject
Style of result can be assigned by input text or image
Model can disentangle information of text and image
Allows users to provide multiple images in one input set

Comparisons

Proposed unified encoder to achieve multi-modality image generation
Designed text-to-image baselines to compare
Compared to state-of-the-art few-shot finetuning-based method
No online training cost
Results prove better results without time-costing finetuning process
Fusing sampling technique proposed to show trade-off on choice of alpha
Alpha of 0.5 leads to better trade-off between input texts and images
Drawbacks of method: multi-object decomposition and rare/highly-factitious subjects
UMM-Diffusion framework for joint subject and text conditional image generation
CLIP Encoders used to extract semantics of texts and images
Fusing sampling technique used as guidance of generation process
Model initialized by pre-trained text-to-image generation model
Experiments demonstrate efficiency of method
Training on 32 NVIDIA V100 32G GPUs for 7 days
Adam optimizer used during training
Scale factor of classiferfree guidance set to 7.5
CLIP Encoders ViT-L/14 based
Noise-prediction model architecture same to Stable Diffusion v1-5
Results show tolerance to low-quality input images
Results show ability to synthesize images with customized style
Visualization of pseudo word embeddings to prove ability to extract similar embeddings from images of one class and different embeddings from images of different classes

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Compositing subjects into scenes#

Text conditional image generation#

Customized image generation#

Unified multi-modal latent diffusion#

Preliminaries#

Fusing sampling technique#

Dataset building and model initialization#

Experiments#

Implementation details#

Applications#

Comparisons#

Link to paper

Abstract

Paper Content

Introduction

Related work

Compositing subjects into scenes

Text conditional image generation

Customized image generation

Unified multi-modal latent diffusion

Preliminaries

Fusing sampling technique

Dataset building and model initialization

Experiments

Implementation details

Applications

Comparisons