Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Text-to-image synthesis has been successful and captured public imagination
  • GANs used to be the favored architecture for generative image models
  • Auto-regressive and diffusion models have become the new standard
  • Can GANs be scaled up to benefit from large datasets?
  • GigaGAN is a new GAN architecture that is faster and can synthesize high-resolution images
  • GigaGAN supports latent space editing applications

Paper Content

Introduction

  • Recently released models have achieved high levels of image quality and model flexibility.
  • Iterative methods enable stable training but are computationally expensive.
  • GANs generate images through a single forward pass and are inherently efficient.
  • Scaling GANs requires careful tuning and training considerations due to instabilities.
  • StyleGAN2 is scaled up and several key issues are identified.
  • Techniques are proposed to stabilize training while increasing model capacity.
  • Multi-scale training improves image-text alignment and low-frequency details.
  • GigaGAN is 36x larger than StyleGAN2 and 6x larger than StyleGAN-XL.
  • GigaGAN is orders of magnitude faster and can generate ultra high-res images.
  • GigaGAN has a controllable latent vector space for controllable image synthesis.
  • GigaGAN is the first GAN-based method to successfully train a billion-scale model.
  • Text-to-image synthesis is a challenging task
  • Earlier works used text-conditional GANs on specific domains and datasets
  • Recent works have shown improvement on an open-world of arbitrary text descriptions
  • Sampling high-quality images requires time-consuming iterative processes
  • GANs have been used for various computer vision and graphics applications
  • GANs have been deployed to text-to-image synthesis
  • Existing GAN-based text-to-image synthesis models are trained on relatively small datasets
  • Super-resolution is used to reduce memory and running time for large-scale text-to-image models
  • Traditional super-resolution techniques aim to faithfully reproduce low-resolution inputs
  • Our upsamplers for large-scale models need to perform larger upsampling factors while leveraging the input text prompt

Method

  • We train a generator to predict an image given a latent code and text-conditioning signal.
  • We use a discriminator to judge the realism of the generated image.
  • Current limitation of GANs stems from reliance on convolutional layers.
  • We seek to inject more expressivity into our parameterization by dynamically selecting convolution filters and capturing long-range dependence via attention mechanism.
  • We introduce a new GAN-based upsampler model to improve inference quality and speed.

Modeling complex contextual interaction

  • Baseline StyleGAN generator composed of two networks
  • Mapping network maps inputs into a “style” vector
  • Synthesis network uses style vector to map constant tensor to output image
  • Sample-adaptive kernel selection creates convolution kernels on-the-fly based on text conditioning
  • Interleaving attention with convolution to incorporate long-range relationships
  • L2-distance used instead of dot product for attention logits to promote Lipschitz continuity
  • Cross-attention mechanism to attend to individual word embeddings

Generator design

  • Text and latent-code conditioning is used to extract text embedding from the prompt.
  • Text embedding is tokenized and processed with additional attention layers.
  • Synthesis network consists of upsampling convolutional layers with adaptive kernel selection and attention layers.
  • Generator outputs a multi-scale image pyramid with 5 levels.
  • Training details are included in Appendix A.1.

Discriminator design

  • Discriminator consists of two branches for processing text and images
  • Introduce a new way of making predictions on multiple scales
  • Text branch processes text similar to generator
  • Image branch receives an image pyramid and makes independent predictions for each image scale
  • Predictions are made at all subsequent scales of the downsampling layers
  • Text descriptor extracted from text c
  • Early, low-resolution layers of the generator become inactive
  • Model architecture redesigned to provide training signals across multiple scales
  • Discriminator produces L(L-1) 2 predictions
  • Text and image features compared using function ψ
  • CLIP and Vision-Aided GAN losses used to improve stability

Gan-based upsampler

  • GigaGAN framework can be extended to train a text-conditioned superresolution model
  • Model is rearranged to an asymmetric U-Net architecture
  • Model is trained with same losses as base model, plus LPIPS Perceptual Loss
  • Gaussian noise augmentation is applied during training and inference
  • GigaGAN is more effective for superresolution task than diffusion-based models

Experiments

  • Systematic, controlled evaluation of large-scale text-to-image synthesis tasks is difficult
  • Comparing model to recent text-to-image models
  • Evaluating model on ImageNet class-conditional generation
  • Using Fréchet Inception Distance (FID) and CLIP score for quantitative evaluation
  • Five different experiments

Training and evaluation details

  • Implemented GigaGAN using StudioGAN Py-Torch library
  • Followed standard FID evaluation protocol with anti-aliasing bicubic resize function
  • Trained models on union of LAION2B-en and COYO-700M datasets
  • Preprocessed image-text pairs based on CLIP score, image resolution, and aesthetic score
  • Used CLIP ViT-L/14 for pre-trained text encoder and OpenCLIP ViT-G/14 for CLIP score calculation
  • Generated four outputs using prompts “a X on tabletop”
  • Re-computed text embeddings and style codes using new prompts
  • Applied them to second half layers of generator for layout-preserving fine style control
  • Cross-attention mechanism localized style to object of interest
  • Trained and evaluated models on A100 GPUs

Effectiveness of proposed components

  • Baseline set up by adding text-conditioning to StyleGAN2 and tuning configuration based on StyleGAN-XL
  • Increasing model size does not improve FID and CLIP scores
  • Adding components one by one improves performance
  • Final formulation is more scalable, higher-capacity version achieves better performance

Text-to-image synthesis

  • Trained a larger model with increased capacity
  • Compared performance to various text-to-image generative models
  • Achieved lower FID than other models
  • Generated promising images from arbitrary text prompts

Comparison with distilled diffusion models

  • GigaGAN is 20 times faster than other diffusion models
  • SDdistilled is an effort to improve inference speed
  • GigaGAN is faster than SDdistilled and has better FID and CLIP scores
  • FID and CLIP scores are reported on COCO2017 dataset with images resized to 512px

Super-resolution for large-scale image synthesis

  • GigaGAN is evaluated in two parts
  • GigaGAN is compared to several commonly-used upsamplers
  • GigaGAN outperforms other upsamplers in realism, text alignment, and closeness to ground truth
  • GigaGAN achieves best IS and FID scores with single feedforward pass

Controllable image synthesis

  • StyleGANs have a linear latent space called W-space for image manipulation
  • GigaGAN also has a disentangled W-space
  • GigaGAN has another latent space of text embedding t
  • Text embedding t and style code w can be used to control style manipulation

Discussion and limitations

  • GANs can scale up to model sizes that enable text-to-image synthesis
  • Visual quality of results is not yet comparable to production-grade models
  • GigaGAN architecture opens up a new design space for large-scale generative models
  • Performance expected to improve with larger models

C. text-to-image synthesis results

  • GAN model can use truncation trick to trade diversity for fidelity
  • Truncation trick is straightforward to apply for unconditional case, less clear for text-conditional image generation
  • Interpolating latent vector towards mean of entire distribution and mean of w conditioned on text prompt produces desirable results
  • Truncation has similar effect to guidance technique of diffusion models