Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Large-scale text-to-image diffusion models have made advances
  • Existing models use text input alone, which can impede controllability
  • GLIGEN is a novel approach that builds upon and extends existing models
  • GLIGEN preserves concept knowledge of pre-trained model and injects grounding information into new trainable layers
  • GLIGEN achieves open-world grounded text2img generation with caption and bounding box condition inputs
  • GLIGEN outperforms existing supervised layout-to-image baselines by a large margin

Paper Content

Introduction

  • Image generation research has seen advances in recent years
  • GANs and text conditional autoregressive and diffusion models have been used
  • These models have practical use cases and can generate high quality images
  • Existing models cannot be conditioned on other input modalities apart from text
  • Propose a method to provide new grounding conditional inputs to pretrained text-to-image diffusion models
  • Model can generalize to unseen objects
  • Model’s zero-shot performance on layout2img tasks outperforms prior state-of-the-art
  • Propose a method to build upon large pretrained generative models for downstream tasks
  • Autoregressive and diffusion models are state-of-the-art for text-to-image generation
  • DALL-E and Parti demonstrate zero-shot and scaling up abilities
  • Diffusion models have shown promising results
  • Masked modeling can achieve SoTA-level generation performance
  • Make-A-Scene incorporates semantic maps into text-to-image generation
  • Layout2Im generates images from bounding boxes
  • Existing layout2image methods are closed-set
  • GANs and diffusion models have been explored for various conditioning information
  • Our work investigates how to build upon existing models to enable open-set grounded image generation

Preliminaries on latent diffusion models

  • Diffusion-based methods are effective for text2image tasks
  • Latent Diffusion Model (LDM) and Stable Diffusion are powerful models
  • LDM has two stages: mapping network to obtain latent representation and diffusion model on latent
  • Training objective is to denoise latent representations of image
  • LDM can generate impressive language-to-image results with pretraining on internet-scale data

Open-set grounded image generation

Grounding instruction input

  • Grounding: entities described through text or example image, spatial configuration described with bounding box or keypoints
  • Caption and grounding entities are processed as input tokens to the diffusion model
  • Existing lay-out2img works only deal with closed-set setting
  • Training data requires both text and grounding entity
  • Three types of data: grounding, detection, detection and caption
  • Image prompt: entity described using an image instead of language
  • Keypoints: richer spatial configurations than bounding boxes

Continual learning for grounded generation

  • Goal is to endow new spatial grounding capabilities to existing large language-to-image generation models
  • Models pre-trained on web-scale imagetext to gain knowledge for synthesizing realistic images
  • Original model weights retained while expanding new capability
  • New gated self-attention layer added to enable spatial grounding ability
  • Attention performed over concatenation of visual and grounding tokens
  • Original denoising objective used for model continual learning
  • Model learns to use additional localization information while retaining pre-trained concept knowledge
  • Versatile interface allows user to ground entities that exist in caption input or add objects freely
  • Scheduled sampling scheme used in inference to improve visual quality and extend model to other domains

Experiments

  • Evaluated model’s grounded text2img generation in closed-set and open-set settings
  • Ablated components of model
  • Showed extensions to image prompt and keypoint grounded generation
  • Conducted quantitative experiments using pretrained LDM on LAION

Closed-set grounded text2img generation

  • Evaluated generation quality and grounding accuracy of model in closed-set setting
  • Trained and evaluated on COCO2014 dataset
  • Used 3 types of grounding instructions
  • Compared to baseline models
  • Used FID and YOLO score to evaluate
  • Model trained with detection annotation instructions had best performance
  • Combining data from all grounding instructions can lead to complementary benefits
  • Used gated self-attention to absorb grounding instruction
  • Ablated on null caption and gated cross-attention
  • Achieved state-of-the-art performance for image quality and grounding accuracy
  • Pretrained model on larger dataset and evaluated zero-shot and finetuned results

Open-set grounded text2img generation

  • GLIGEN can generate grounded entities beyond the COCO categories
  • GLIGEN learns to re-position the visual features corresponding to the grounding entities
  • Model is evaluated on LVIS and outperforms supervised baseline
  • Performance increases as training data is scaled up
  • Model gains grounding ability compared to vanilla Stable Diffusion

Inpainting comparison

  • GLIGEN can be used for inpainting tasks
  • An experiment was conducted on the COCO dataset to inpaint randomly masked objects of different sizes
  • Results show that GLIGEN inpainted objects more tightly occupy the missing region compared to baselines

Keypoints grounding

  • Model uses bounding boxes and human keypoints as grounding conditions for generation
  • Model compared to pix2pixHD
  • Model trained with and without captions
  • Model generates better image quality than pix2pixHD
  • Model can be used to specify scene and person’s gender for image creation

Image grounding

  • Image grounded generation uses a reference image to represent a grounded entity.
  • Text and image grounded generation combines both text and image representations for more creative generation.
  • Image grounded inpainting uses a reference image to fill in missing regions.

Scheduled sampling

  • Scheduled inference time sampling can be used to improve image quality.
  • Scheduled sampling can be used to extend a model trained with human keypoints to generate other objects with a human-like shape.
  • Evaluating the GLIP score of the generated images on the LVIS dataset shows that scheduled sampling can improve performance.

Conclusion

  • Proposed GLIGEN for expanding pretrained text2img diffusion models with grounding ability
  • Demonstrated open-world generalization using bounding boxes as the grounding condition
  • Method is simple and effective, and can be easily extended to other conditions
  • Limitation is that generated style or aesthetic distribution can shift after adding new gated self-attention layers
  • Believed adding images from more diverse style distributions or further finetuning the model with highly aesthetic images could help
  • Used CLIP image encoder to get an image embedding
  • Projected image features into the text feature space
  • Learned N person token embedding vectors to semantically link keypoints belonging to the same person
  • Inserted self-attention layer is the same as the original diffusion model self-attention layer
  • Used learning rate of 5e-5 with Adam
  • Randomly dropped caption and grounding tokens with 10% probability
  • Finetuned model on LVIS and COCO2017 val-set
  • Compared to layout2img baselines
  • Grounded text2img results with bounding boxes and images or keypoints
  • Struggled to generate graphics style images when τ is set to 1
  • Keypoints are less generalizable than bounding boxes