Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Large-scale text-to-image diffusion models have made advances
Existing models use text input alone, which can impede controllability
GLIGEN is a novel approach that builds upon and extends existing models
GLIGEN preserves concept knowledge of pre-trained model and injects grounding information into new trainable layers
GLIGEN achieves open-world grounded text2img generation with caption and bounding box condition inputs
GLIGEN outperforms existing supervised layout-to-image baselines by a large margin

Paper Content

Introduction

Image generation research has seen advances in recent years
GANs and text conditional autoregressive and diffusion models have been used
These models have practical use cases and can generate high quality images
Existing models cannot be conditioned on other input modalities apart from text
Propose a method to provide new grounding conditional inputs to pretrained text-to-image diffusion models
Model can generalize to unseen objects
Model’s zero-shot performance on layout2img tasks outperforms prior state-of-the-art
Propose a method to build upon large pretrained generative models for downstream tasks

Autoregressive and diffusion models are state-of-the-art for text-to-image generation
DALL-E and Parti demonstrate zero-shot and scaling up abilities
Diffusion models have shown promising results
Masked modeling can achieve SoTA-level generation performance
Make-A-Scene incorporates semantic maps into text-to-image generation
Layout2Im generates images from bounding boxes
Existing layout2image methods are closed-set
GANs and diffusion models have been explored for various conditioning information
Our work investigates how to build upon existing models to enable open-set grounded image generation

Preliminaries on latent diffusion models

Diffusion-based methods are effective for text2image tasks
Latent Diffusion Model (LDM) and Stable Diffusion are powerful models
LDM has two stages: mapping network to obtain latent representation and diffusion model on latent
Training objective is to denoise latent representations of image
LDM can generate impressive language-to-image results with pretraining on internet-scale data

Open-set grounded image generation

Grounding instruction input

Grounding: entities described through text or example image, spatial configuration described with bounding box or keypoints
Caption and grounding entities are processed as input tokens to the diffusion model
Existing lay-out2img works only deal with closed-set setting
Training data requires both text and grounding entity
Three types of data: grounding, detection, detection and caption
Image prompt: entity described using an image instead of language
Keypoints: richer spatial configurations than bounding boxes

Continual learning for grounded generation

Goal is to endow new spatial grounding capabilities to existing large language-to-image generation models
Models pre-trained on web-scale imagetext to gain knowledge for synthesizing realistic images
Original model weights retained while expanding new capability
New gated self-attention layer added to enable spatial grounding ability
Attention performed over concatenation of visual and grounding tokens
Original denoising objective used for model continual learning
Model learns to use additional localization information while retaining pre-trained concept knowledge
Versatile interface allows user to ground entities that exist in caption input or add objects freely
Scheduled sampling scheme used in inference to improve visual quality and extend model to other domains

Experiments

Evaluated model’s grounded text2img generation in closed-set and open-set settings
Ablated components of model
Showed extensions to image prompt and keypoint grounded generation
Conducted quantitative experiments using pretrained LDM on LAION

Closed-set grounded text2img generation

Evaluated generation quality and grounding accuracy of model in closed-set setting
Trained and evaluated on COCO2014 dataset
Used 3 types of grounding instructions
Compared to baseline models
Used FID and YOLO score to evaluate
Model trained with detection annotation instructions had best performance
Combining data from all grounding instructions can lead to complementary benefits
Used gated self-attention to absorb grounding instruction
Ablated on null caption and gated cross-attention
Achieved state-of-the-art performance for image quality and grounding accuracy
Pretrained model on larger dataset and evaluated zero-shot and finetuned results

Open-set grounded text2img generation

GLIGEN can generate grounded entities beyond the COCO categories
GLIGEN learns to re-position the visual features corresponding to the grounding entities
Model is evaluated on LVIS and outperforms supervised baseline
Performance increases as training data is scaled up
Model gains grounding ability compared to vanilla Stable Diffusion

Inpainting comparison

GLIGEN can be used for inpainting tasks
An experiment was conducted on the COCO dataset to inpaint randomly masked objects of different sizes
Results show that GLIGEN inpainted objects more tightly occupy the missing region compared to baselines

Keypoints grounding

Model uses bounding boxes and human keypoints as grounding conditions for generation
Model compared to pix2pixHD
Model trained with and without captions
Model generates better image quality than pix2pixHD
Model can be used to specify scene and person’s gender for image creation

Image grounding

Image grounded generation uses a reference image to represent a grounded entity.
Text and image grounded generation combines both text and image representations for more creative generation.
Image grounded inpainting uses a reference image to fill in missing regions.

Scheduled sampling

Scheduled inference time sampling can be used to improve image quality.
Scheduled sampling can be used to extend a model trained with human keypoints to generate other objects with a human-like shape.
Evaluating the GLIP score of the generated images on the LVIS dataset shows that scheduled sampling can improve performance.

Conclusion

Proposed GLIGEN for expanding pretrained text2img diffusion models with grounding ability
Demonstrated open-world generalization using bounding boxes as the grounding condition
Method is simple and effective, and can be easily extended to other conditions
Limitation is that generated style or aesthetic distribution can shift after adding new gated self-attention layers
Believed adding images from more diverse style distributions or further finetuning the model with highly aesthetic images could help
Used CLIP image encoder to get an image embedding
Projected image features into the text feature space
Learned N person token embedding vectors to semantically link keypoints belonging to the same person
Inserted self-attention layer is the same as the original diffusion model self-attention layer
Used learning rate of 5e-5 with Adam
Randomly dropped caption and grounding tokens with 10% probability
Finetuned model on LVIS and COCO2017 val-set
Compared to layout2img baselines
Grounded text2img results with bounding boxes and images or keypoints
Struggled to generate graphics style images when τ is set to 1
Keypoints are less generalizable than bounding boxes

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Preliminaries on latent diffusion models#

Open-set grounded image generation#

Grounding instruction input#

Continual learning for grounded generation#

Experiments#

Closed-set grounded text2img generation#

Open-set grounded text2img generation#

Inpainting comparison#

Keypoints grounding#

Image grounding#

Scheduled sampling#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Preliminaries on latent diffusion models

Open-set grounded image generation

Grounding instruction input

Continual learning for grounded generation

Experiments

Closed-set grounded text2img generation

Open-set grounded text2img generation

Inpainting comparison

Keypoints grounding

Image grounding

Scheduled sampling

Conclusion