Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Language-guided image generation has been successful using diffusion models.
- Texts can be too vague to accurately describe specific subjects.
- UMM-Diffusion takes joint texts and images as input and generates customized images.
- Input images are projected to pseudo word embedding and combined with text to guide image generation.
- Sampling technique of diffusion models used to eliminate irrelevant parts of input images.
- Leveraging pre-trained text-to-image generator and image encoder to generate high-quality images with complex semantics.
Paper Content
Introduction
- Synthesizing images with customized subjects is difficult
- Input data format is text and images
- Existing text-to-image models cannot achieve this goal
- Recent methods use a special text token to represent the input subject
- These methods require time and computing resources and high-resolution images
- Proposed method takes joint texts and images in one sequence as input
- Encodes them into one unified multi-modal latent space
- Novel sampling technique of diffusion models to mitigate overfitting problem
- Results of multi-modal guidance and pure text guidance fused in one denoising step
Related work
Compositing subjects into scenes
- Extracting a subject from an image and compositing it into a new scene is a common scenario in image editing.
- Traditional methods use statistic features to align the subjects and the new scenes.
- Deep learning methods use deep features to achieve better performance.
Text conditional image generation
- Generating semantically aligned images conditioned by texts has drawn attention
- Early methods used GANs and were trained on limited data
- Recent methods use pre-trained text encoders and large-scale generators
- Diffusion models have developed rapidly and generate high quality and diverse images
- Diffusion-based text-to-image generation methods have achieved impressive results
- These methods require large datasets and great computing resources
- However, they are still limited in synthesizing personalized results
Customized image generation
- Several methods have been proposed to provide more control of the image generation process.
- IC-GAN uses the representation of certain instances as conditions to generate new images.
- TI and DreamBooth can take both texts and subjects provided by images as conditions to create new images.
- TI and DreamBooth require a lot of time and computing resources.
Unified multi-modal latent diffusion
- Aim to generate high-quality images semantically aligned with input texts.
- Encode texts and images into a unified multi-modal latent embedding.
- Challenges include: different modalities, irrelevant information, lack of data.
Preliminaries
- Diffusion models map arbitrary data distribution to Gaussian distribution by adding noise
- Reverse process uses noise-prediction model to recover noise of t-step noisy image
- Additional condition (e.g. captions) can be provided to noise-prediction model
- Text encoding is vital to image quality and semantic alignment
- CLIP is a widely used encoder for extracting semantic representation from texts and images
Fusing sampling technique
- TIUE takes the whole image as input, including both the subject and the irrelevant background.
- Overfitting is a challenge that needs to be addressed.
- Text input is used to help solve the overfitting problem.
- A fusing sampling technique is proposed to combine the multimodal guidance and pure text guidance.
Dataset building and model initialization
- Train model using LAION-400M dataset
- Automatically crop sub-images from images in dataset
- Retrieve labels of sub-images from captions
- Filter dataset to 1.8M training sets with rules
Experiments
Implementation details
- Trained TIUE for 200k iterations and then trained the whole model for another 200k iterations
- Batch size of 192 and learning rate of 1e-5 for both phases
- Used EMA rate of 0.9999 to stabilize training process
- Fuse ratio ฮฑ set to 0.5 during inference
- First to design unified encoder that takes joint subjects from images and texts for image generation
Applications
- Generates images described by input text and containing objects from input images
- Can generate diverse novel views of target subject
- Style of result can be assigned by input text or image
- Model can disentangle information of text and image
- Allows users to provide multiple images in one input set
Comparisons
- Proposed unified encoder to achieve multi-modality image generation
- Designed text-to-image baselines to compare
- Compared to state-of-the-art few-shot finetuning-based method
- No online training cost
- Results prove better results without time-costing finetuning process
- Fusing sampling technique proposed to show trade-off on choice of alpha
- Alpha of 0.5 leads to better trade-off between input texts and images
- Drawbacks of method: multi-object decomposition and rare/highly-factitious subjects
- UMM-Diffusion framework for joint subject and text conditional image generation
- CLIP Encoders used to extract semantics of texts and images
- Fusing sampling technique used as guidance of generation process
- Model initialized by pre-trained text-to-image generation model
- Experiments demonstrate efficiency of method
- Training on 32 NVIDIA V100 32G GPUs for 7 days
- Adam optimizer used during training
- Scale factor of classiferfree guidance set to 7.5
- CLIP Encoders ViT-L/14 based
- Noise-prediction model architecture same to Stable Diffusion v1-5
- Results show tolerance to low-quality input images
- Results show ability to synthesize images with customized style
- Visualization of pseudo word embeddings to prove ability to extract similar embeddings from images of one class and different embeddings from images of different classes