Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Text-to-image synthesis has seen progress due to large pretrained language models, large-scale training data, and scalable model families.
  • Iterative evaluation is needed to generate a single sample with the best-performing models.
  • Generative adversarial networks (GANs) only need a single forward pass and are faster, but remain far behind the state-of-the-art.
  • This paper aims to identify the necessary steps to regain competitiveness.
  • StyleGAN-T addresses the specific requirements of large-scale text-to-image synthesis.
  • StyleGAN-T improves over previous GANs and outperforms distilled diffusion models.

Paper Content

Introduction

  • Text-to-image synthesis generates novel images based on text prompts
  • Recent advances in this task are due to two ideas: using a large pretrained language model as an encoder and using large-scale training data
  • Training datasets are increasing in size and coverage, so models must be scalable
  • Recent successes in text-to-image generation have been driven by diffusion models and autoregressive models
  • Generative adversarial networks (GANs) have not been successful in this task
  • Goal is to show GANs can regain competitiveness
  • GANs offer inference speed and control of the synthesized result via latent space manipulations
  • StyleGAN has a thoroughly studied latent space
  • StyleGAN-T achieves better zero-shot MS COCO FID than current state-of-the-art diffusion models
  • Key benefits of StyleGAN-T include fast inference speed and smooth latent space interpolation

Stylegan-xl

  • Architecture design based on StyleGAN-XL
  • Mapping network processes input latent code
  • Weight demodulation technique used
  • Synthesis network uses alias-free primitive operations
  • Discriminator design with multiple heads
  • Feature projections from two pretrained networks
  • Synthesis network trained progressively
  • Discriminator structure does not change
  • Class-conditional synthesis uses projection discriminator

Stylegan-t

  • We chose StyleGAN-XL as our baseline architecture for class-conditional ImageNet synthesis
  • We modified the baseline piece by piece, focusing on the generator, discriminator, and variation vs. text alignment tradeoff mechanisms
  • We measured the effect of our changes using zero-shot MS COCO
  • We computed the CLIP score using a ViT-g-14 model trained on LAION-2B
  • We changed the class conditioning to text conditioning by embedding the text prompts and removing the training-time classifier guidance
  • This baseline reached a zero-shot FID of 51.88 and CLIP score of 5.58 in our lightweight training configuration

Redesigning the generator

  • StyleGAN-XL uses StyleGAN3 layers to achieve translational equivariance
  • Equivariance is not necessary for text-to-image synthesis
  • Equivariance adds computational cost and poses limitations to training data
  • StyleGAN2 backbone used for synthesis layers
  • Residual convolutions and GroupNorm/Layer Scale used to increase model capacity
  • Text embeddings bypass mapping network and are split into three vectors for 2nd order polynomial network

Redesigning the discriminator

  • Redesigned discriminator retains key ideas of StyleGAN-XL
  • Feature network is ViT-S trained with self-supervised DINO objective
  • Discriminator architecture uses 5 heads spaced between transformer layers
  • Heads use 1D convolutions on token sequence
  • Differentiable data augmentation applied before feature network
  • Improves FID and CLIP score by ∼40%
  • 2.5x faster than StyleGAN-XL discriminator

Variation vs. text alignment tradeoffs

  • Guidance is an essential component of current text-to-image diffusion models
  • Guidance improves results significantly
  • CLIP image encoder is used instead of a classifier to provide additional gradients during training
  • CLIP guidance improves FID and CLIP scores
  • Generator is frozen and text encoder is trainable in secondary phase
  • Truncation is used to trade variation for higher fidelity

Experiments

  • Model size increased to 1 billion parameters without instabilities
  • Trained on 250M textimage pairs
  • Training time was 4 weeks on 64 A100 GPUs
  • Hyperparameters and dataset details in Appendix A
  • Total compute budget is about a quarter of Stable Diffusion’s

Quantitative comparison to state-of-the-art

  • We compare the performance of our model to the state-of-the-art quantitatively at 64x64 pixel output resolution.
  • GANs can match or even beat current DMs in large-scale text-to-image synthesis at low resolution.
  • A powerful superresolution model is crucial, as FID almost doubles in StyleGAN-T when moving from 64x64 to 256x256.

Evaluating variation vs. text alignment

  • FID-CLIP score curves are reported in Fig. 5
  • StyleGAN-T is compared to a strong and fast DM baseline
  • FID-CLIP score curves are evaluated in Fig. 6
  • Text encoder is fine-tuned to improve CLIP score without compromising FID
  • StyleGAN-T can generate a wide variety of styles as shown in Fig. 8
  • Subjects tend to be aligned for a fixed latent z

Limitations and future work

  • DALL•E 2 and StyleGAN-T struggle to bind attributes to objects and produce coherent text in images
  • CLIP loss is important for good text alignment, but too much guidance strength can cause image artifacts
  • Truncation improves text alignment, but alternative methods might further improve results
  • Future work could include improved super-resolution stages and personalizing GANs

A. configuration details

  • Two training configurations used: lightweight and full
  • Lightweight configuration uses CC12M dataset at 64x64 resolution without progressive growing
  • Full configuration uses union of several datasets, 250M text-image pairs, and progressive growing
  • Training budget spent on resolutions up to 64x64
  • Training schedules listed in Table 5

B. truncation grids

  • Quality vs. speed in large-scale text-to-image synthesis
  • Generates samples at 10 FPS on NVIDIA A100
  • Allows for smooth interpolations between prompts
  • Generator architecture is related to StyleGAN2
  • Discriminator processes intermediate tokens of a DINO-trained vision transformer
  • Text prompt is embedded using CLIP
  • Truncation improves text alignment
  • Compares FID-CLIP score curves of StyleGAN-T, distilled Stable Diffusion, and eDiff-I
  • Latent manipulation and styles can be generated