Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Large-scale text-to-image diffusion models have made advances
- Existing models use text input alone, which can impede controllability
- GLIGEN is a novel approach that builds upon and extends existing models
- GLIGEN preserves concept knowledge of pre-trained model and injects grounding information into new trainable layers
- GLIGEN achieves open-world grounded text2img generation with caption and bounding box condition inputs
- GLIGEN outperforms existing supervised layout-to-image baselines by a large margin
Paper Content
Introduction
- Image generation research has seen advances in recent years
- GANs and text conditional autoregressive and diffusion models have been used
- These models have practical use cases and can generate high quality images
- Existing models cannot be conditioned on other input modalities apart from text
- Propose a method to provide new grounding conditional inputs to pretrained text-to-image diffusion models
- Model can generalize to unseen objects
- Model’s zero-shot performance on layout2img tasks outperforms prior state-of-the-art
- Propose a method to build upon large pretrained generative models for downstream tasks
Related work
- Autoregressive and diffusion models are state-of-the-art for text-to-image generation
- DALL-E and Parti demonstrate zero-shot and scaling up abilities
- Diffusion models have shown promising results
- Masked modeling can achieve SoTA-level generation performance
- Make-A-Scene incorporates semantic maps into text-to-image generation
- Layout2Im generates images from bounding boxes
- Existing layout2image methods are closed-set
- GANs and diffusion models have been explored for various conditioning information
- Our work investigates how to build upon existing models to enable open-set grounded image generation
Preliminaries on latent diffusion models
- Diffusion-based methods are effective for text2image tasks
- Latent Diffusion Model (LDM) and Stable Diffusion are powerful models
- LDM has two stages: mapping network to obtain latent representation and diffusion model on latent
- Training objective is to denoise latent representations of image
- LDM can generate impressive language-to-image results with pretraining on internet-scale data
Open-set grounded image generation
Grounding instruction input
- Grounding: entities described through text or example image, spatial configuration described with bounding box or keypoints
- Caption and grounding entities are processed as input tokens to the diffusion model
- Existing lay-out2img works only deal with closed-set setting
- Training data requires both text and grounding entity
- Three types of data: grounding, detection, detection and caption
- Image prompt: entity described using an image instead of language
- Keypoints: richer spatial configurations than bounding boxes
Continual learning for grounded generation
- Goal is to endow new spatial grounding capabilities to existing large language-to-image generation models
- Models pre-trained on web-scale imagetext to gain knowledge for synthesizing realistic images
- Original model weights retained while expanding new capability
- New gated self-attention layer added to enable spatial grounding ability
- Attention performed over concatenation of visual and grounding tokens
- Original denoising objective used for model continual learning
- Model learns to use additional localization information while retaining pre-trained concept knowledge
- Versatile interface allows user to ground entities that exist in caption input or add objects freely
- Scheduled sampling scheme used in inference to improve visual quality and extend model to other domains
Experiments
- Evaluated model’s grounded text2img generation in closed-set and open-set settings
- Ablated components of model
- Showed extensions to image prompt and keypoint grounded generation
- Conducted quantitative experiments using pretrained LDM on LAION
Closed-set grounded text2img generation
- Evaluated generation quality and grounding accuracy of model in closed-set setting
- Trained and evaluated on COCO2014 dataset
- Used 3 types of grounding instructions
- Compared to baseline models
- Used FID and YOLO score to evaluate
- Model trained with detection annotation instructions had best performance
- Combining data from all grounding instructions can lead to complementary benefits
- Used gated self-attention to absorb grounding instruction
- Ablated on null caption and gated cross-attention
- Achieved state-of-the-art performance for image quality and grounding accuracy
- Pretrained model on larger dataset and evaluated zero-shot and finetuned results
Open-set grounded text2img generation
- GLIGEN can generate grounded entities beyond the COCO categories
- GLIGEN learns to re-position the visual features corresponding to the grounding entities
- Model is evaluated on LVIS and outperforms supervised baseline
- Performance increases as training data is scaled up
- Model gains grounding ability compared to vanilla Stable Diffusion
Inpainting comparison
- GLIGEN can be used for inpainting tasks
- An experiment was conducted on the COCO dataset to inpaint randomly masked objects of different sizes
- Results show that GLIGEN inpainted objects more tightly occupy the missing region compared to baselines
Keypoints grounding
- Model uses bounding boxes and human keypoints as grounding conditions for generation
- Model compared to pix2pixHD
- Model trained with and without captions
- Model generates better image quality than pix2pixHD
- Model can be used to specify scene and person’s gender for image creation
Image grounding
- Image grounded generation uses a reference image to represent a grounded entity.
- Text and image grounded generation combines both text and image representations for more creative generation.
- Image grounded inpainting uses a reference image to fill in missing regions.
Scheduled sampling
- Scheduled inference time sampling can be used to improve image quality.
- Scheduled sampling can be used to extend a model trained with human keypoints to generate other objects with a human-like shape.
- Evaluating the GLIP score of the generated images on the LVIS dataset shows that scheduled sampling can improve performance.
Conclusion
- Proposed GLIGEN for expanding pretrained text2img diffusion models with grounding ability
- Demonstrated open-world generalization using bounding boxes as the grounding condition
- Method is simple and effective, and can be easily extended to other conditions
- Limitation is that generated style or aesthetic distribution can shift after adding new gated self-attention layers
- Believed adding images from more diverse style distributions or further finetuning the model with highly aesthetic images could help
- Used CLIP image encoder to get an image embedding
- Projected image features into the text feature space
- Learned N person token embedding vectors to semantically link keypoints belonging to the same person
- Inserted self-attention layer is the same as the original diffusion model self-attention layer
- Used learning rate of 5e-5 with Adam
- Randomly dropped caption and grounding tokens with 10% probability
- Finetuned model on LVIS and COCO2017 val-set
- Compared to layout2img baselines
- Grounded text2img results with bounding boxes and images or keypoints
- Struggled to generate graphics style images when τ is set to 1
- Keypoints are less generalizable than bounding boxes