Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Text-to-image synthesis has seen progress due to large pretrained language models, large-scale training data, and scalable model families.
Iterative evaluation is needed to generate a single sample with the best-performing models.
Generative adversarial networks (GANs) only need a single forward pass and are faster, but remain far behind the state-of-the-art.
This paper aims to identify the necessary steps to regain competitiveness.
StyleGAN-T addresses the specific requirements of large-scale text-to-image synthesis.
StyleGAN-T improves over previous GANs and outperforms distilled diffusion models.

Paper Content

Introduction

Text-to-image synthesis generates novel images based on text prompts
Recent advances in this task are due to two ideas: using a large pretrained language model as an encoder and using large-scale training data
Training datasets are increasing in size and coverage, so models must be scalable
Recent successes in text-to-image generation have been driven by diffusion models and autoregressive models
Generative adversarial networks (GANs) have not been successful in this task
Goal is to show GANs can regain competitiveness
GANs offer inference speed and control of the synthesized result via latent space manipulations
StyleGAN has a thoroughly studied latent space
StyleGAN-T achieves better zero-shot MS COCO FID than current state-of-the-art diffusion models
Key benefits of StyleGAN-T include fast inference speed and smooth latent space interpolation

Stylegan-xl

Architecture design based on StyleGAN-XL
Mapping network processes input latent code
Weight demodulation technique used
Synthesis network uses alias-free primitive operations
Discriminator design with multiple heads
Feature projections from two pretrained networks
Synthesis network trained progressively
Discriminator structure does not change
Class-conditional synthesis uses projection discriminator

Stylegan-t

We chose StyleGAN-XL as our baseline architecture for class-conditional ImageNet synthesis
We modified the baseline piece by piece, focusing on the generator, discriminator, and variation vs. text alignment tradeoff mechanisms
We measured the effect of our changes using zero-shot MS COCO
We computed the CLIP score using a ViT-g-14 model trained on LAION-2B
We changed the class conditioning to text conditioning by embedding the text prompts and removing the training-time classifier guidance
This baseline reached a zero-shot FID of 51.88 and CLIP score of 5.58 in our lightweight training configuration

Redesigning the generator

StyleGAN-XL uses StyleGAN3 layers to achieve translational equivariance
Equivariance is not necessary for text-to-image synthesis
Equivariance adds computational cost and poses limitations to training data
StyleGAN2 backbone used for synthesis layers
Residual convolutions and GroupNorm/Layer Scale used to increase model capacity
Text embeddings bypass mapping network and are split into three vectors for 2nd order polynomial network

Redesigning the discriminator

Redesigned discriminator retains key ideas of StyleGAN-XL
Feature network is ViT-S trained with self-supervised DINO objective
Discriminator architecture uses 5 heads spaced between transformer layers
Heads use 1D convolutions on token sequence
Differentiable data augmentation applied before feature network
Improves FID and CLIP score by ∼40%
2.5x faster than StyleGAN-XL discriminator

Variation vs. text alignment tradeoffs

Guidance is an essential component of current text-to-image diffusion models
Guidance improves results significantly
CLIP image encoder is used instead of a classifier to provide additional gradients during training
CLIP guidance improves FID and CLIP scores
Generator is frozen and text encoder is trainable in secondary phase
Truncation is used to trade variation for higher fidelity

Experiments

Model size increased to 1 billion parameters without instabilities
Trained on 250M textimage pairs
Training time was 4 weeks on 64 A100 GPUs
Hyperparameters and dataset details in Appendix A
Total compute budget is about a quarter of Stable Diffusion’s

Quantitative comparison to state-of-the-art

We compare the performance of our model to the state-of-the-art quantitatively at 64x64 pixel output resolution.
GANs can match or even beat current DMs in large-scale text-to-image synthesis at low resolution.
A powerful superresolution model is crucial, as FID almost doubles in StyleGAN-T when moving from 64x64 to 256x256.

Evaluating variation vs. text alignment

FID-CLIP score curves are reported in Fig. 5
StyleGAN-T is compared to a strong and fast DM baseline
FID-CLIP score curves are evaluated in Fig. 6
Text encoder is fine-tuned to improve CLIP score without compromising FID
StyleGAN-T can generate a wide variety of styles as shown in Fig. 8
Subjects tend to be aligned for a fixed latent z

Limitations and future work

DALL•E 2 and StyleGAN-T struggle to bind attributes to objects and produce coherent text in images
CLIP loss is important for good text alignment, but too much guidance strength can cause image artifacts
Truncation improves text alignment, but alternative methods might further improve results
Future work could include improved super-resolution stages and personalizing GANs

A. configuration details

Two training configurations used: lightweight and full
Lightweight configuration uses CC12M dataset at 64x64 resolution without progressive growing
Full configuration uses union of several datasets, 250M text-image pairs, and progressive growing
Training budget spent on resolutions up to 64x64
Training schedules listed in Table 5

B. truncation grids

Quality vs. speed in large-scale text-to-image synthesis
Generates samples at 10 FPS on NVIDIA A100
Allows for smooth interpolations between prompts
Generator architecture is related to StyleGAN2
Discriminator processes intermediate tokens of a DINO-trained vision transformer
Text prompt is embedded using CLIP
Truncation improves text alignment
Compares FID-CLIP score curves of StyleGAN-T, distilled Stable Diffusion, and eDiff-I
Latent manipulation and styles can be generated

Link to paper#

Abstract#

Paper Content#

Introduction#

Stylegan-xl#

Stylegan-t#

Redesigning the generator#

Redesigning the discriminator#

Variation vs. text alignment tradeoffs#

Experiments#

Quantitative comparison to state-of-the-art#

Evaluating variation vs. text alignment#

Limitations and future work#

A. configuration details#

B. truncation grids#

Link to paper

Abstract

Paper Content

Introduction

Stylegan-xl

Stylegan-t

Redesigning the generator

Redesigning the discriminator

Variation vs. text alignment tradeoffs

Experiments

Quantitative comparison to state-of-the-art

Evaluating variation vs. text alignment

Limitations and future work

A. configuration details

B. truncation grids