Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Text-to-image synthesis has seen progress due to large pretrained language models, large-scale training data, and scalable model families.
- Iterative evaluation is needed to generate a single sample with the best-performing models.
- Generative adversarial networks (GANs) only need a single forward pass and are faster, but remain far behind the state-of-the-art.
- This paper aims to identify the necessary steps to regain competitiveness.
- StyleGAN-T addresses the specific requirements of large-scale text-to-image synthesis.
- StyleGAN-T improves over previous GANs and outperforms distilled diffusion models.
Paper Content
Introduction
- Text-to-image synthesis generates novel images based on text prompts
- Recent advances in this task are due to two ideas: using a large pretrained language model as an encoder and using large-scale training data
- Training datasets are increasing in size and coverage, so models must be scalable
- Recent successes in text-to-image generation have been driven by diffusion models and autoregressive models
- Generative adversarial networks (GANs) have not been successful in this task
- Goal is to show GANs can regain competitiveness
- GANs offer inference speed and control of the synthesized result via latent space manipulations
- StyleGAN has a thoroughly studied latent space
- StyleGAN-T achieves better zero-shot MS COCO FID than current state-of-the-art diffusion models
- Key benefits of StyleGAN-T include fast inference speed and smooth latent space interpolation
Stylegan-xl
- Architecture design based on StyleGAN-XL
- Mapping network processes input latent code
- Weight demodulation technique used
- Synthesis network uses alias-free primitive operations
- Discriminator design with multiple heads
- Feature projections from two pretrained networks
- Synthesis network trained progressively
- Discriminator structure does not change
- Class-conditional synthesis uses projection discriminator
Stylegan-t
- We chose StyleGAN-XL as our baseline architecture for class-conditional ImageNet synthesis
- We modified the baseline piece by piece, focusing on the generator, discriminator, and variation vs. text alignment tradeoff mechanisms
- We measured the effect of our changes using zero-shot MS COCO
- We computed the CLIP score using a ViT-g-14 model trained on LAION-2B
- We changed the class conditioning to text conditioning by embedding the text prompts and removing the training-time classifier guidance
- This baseline reached a zero-shot FID of 51.88 and CLIP score of 5.58 in our lightweight training configuration
Redesigning the generator
- StyleGAN-XL uses StyleGAN3 layers to achieve translational equivariance
- Equivariance is not necessary for text-to-image synthesis
- Equivariance adds computational cost and poses limitations to training data
- StyleGAN2 backbone used for synthesis layers
- Residual convolutions and GroupNorm/Layer Scale used to increase model capacity
- Text embeddings bypass mapping network and are split into three vectors for 2nd order polynomial network
Redesigning the discriminator
- Redesigned discriminator retains key ideas of StyleGAN-XL
- Feature network is ViT-S trained with self-supervised DINO objective
- Discriminator architecture uses 5 heads spaced between transformer layers
- Heads use 1D convolutions on token sequence
- Differentiable data augmentation applied before feature network
- Improves FID and CLIP score by ∼40%
- 2.5x faster than StyleGAN-XL discriminator
Variation vs. text alignment tradeoffs
- Guidance is an essential component of current text-to-image diffusion models
- Guidance improves results significantly
- CLIP image encoder is used instead of a classifier to provide additional gradients during training
- CLIP guidance improves FID and CLIP scores
- Generator is frozen and text encoder is trainable in secondary phase
- Truncation is used to trade variation for higher fidelity
Experiments
- Model size increased to 1 billion parameters without instabilities
- Trained on 250M textimage pairs
- Training time was 4 weeks on 64 A100 GPUs
- Hyperparameters and dataset details in Appendix A
- Total compute budget is about a quarter of Stable Diffusion’s
Quantitative comparison to state-of-the-art
- We compare the performance of our model to the state-of-the-art quantitatively at 64x64 pixel output resolution.
- GANs can match or even beat current DMs in large-scale text-to-image synthesis at low resolution.
- A powerful superresolution model is crucial, as FID almost doubles in StyleGAN-T when moving from 64x64 to 256x256.
Evaluating variation vs. text alignment
- FID-CLIP score curves are reported in Fig. 5
- StyleGAN-T is compared to a strong and fast DM baseline
- FID-CLIP score curves are evaluated in Fig. 6
- Text encoder is fine-tuned to improve CLIP score without compromising FID
- StyleGAN-T can generate a wide variety of styles as shown in Fig. 8
- Subjects tend to be aligned for a fixed latent z
Limitations and future work
- DALL•E 2 and StyleGAN-T struggle to bind attributes to objects and produce coherent text in images
- CLIP loss is important for good text alignment, but too much guidance strength can cause image artifacts
- Truncation improves text alignment, but alternative methods might further improve results
- Future work could include improved super-resolution stages and personalizing GANs
A. configuration details
- Two training configurations used: lightweight and full
- Lightweight configuration uses CC12M dataset at 64x64 resolution without progressive growing
- Full configuration uses union of several datasets, 250M text-image pairs, and progressive growing
- Training budget spent on resolutions up to 64x64
- Training schedules listed in Table 5
B. truncation grids
- Quality vs. speed in large-scale text-to-image synthesis
- Generates samples at 10 FPS on NVIDIA A100
- Allows for smooth interpolations between prompts
- Generator architecture is related to StyleGAN2
- Discriminator processes intermediate tokens of a DINO-trained vision transformer
- Text prompt is embedded using CLIP
- Truncation improves text alignment
- Compares FID-CLIP score curves of StyleGAN-T, distilled Stable Diffusion, and eDiff-I
- Latent manipulation and styles can be generated