Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Text-to-image synthesis has been successful and captured public imagination
GANs used to be the favored architecture for generative image models
Auto-regressive and diffusion models have become the new standard
Can GANs be scaled up to benefit from large datasets?
GigaGAN is a new GAN architecture that is faster and can synthesize high-resolution images
GigaGAN supports latent space editing applications

Paper Content

Introduction

Recently released models have achieved high levels of image quality and model flexibility.
Iterative methods enable stable training but are computationally expensive.
GANs generate images through a single forward pass and are inherently efficient.
Scaling GANs requires careful tuning and training considerations due to instabilities.
StyleGAN2 is scaled up and several key issues are identified.
Techniques are proposed to stabilize training while increasing model capacity.
Multi-scale training improves image-text alignment and low-frequency details.
GigaGAN is 36x larger than StyleGAN2 and 6x larger than StyleGAN-XL.
GigaGAN is orders of magnitude faster and can generate ultra high-res images.
GigaGAN has a controllable latent vector space for controllable image synthesis.
GigaGAN is the first GAN-based method to successfully train a billion-scale model.

Text-to-image synthesis is a challenging task
Earlier works used text-conditional GANs on specific domains and datasets
Recent works have shown improvement on an open-world of arbitrary text descriptions
Sampling high-quality images requires time-consuming iterative processes
GANs have been used for various computer vision and graphics applications
GANs have been deployed to text-to-image synthesis
Existing GAN-based text-to-image synthesis models are trained on relatively small datasets
Super-resolution is used to reduce memory and running time for large-scale text-to-image models
Traditional super-resolution techniques aim to faithfully reproduce low-resolution inputs
Our upsamplers for large-scale models need to perform larger upsampling factors while leveraging the input text prompt

Method

We train a generator to predict an image given a latent code and text-conditioning signal.
We use a discriminator to judge the realism of the generated image.
Current limitation of GANs stems from reliance on convolutional layers.
We seek to inject more expressivity into our parameterization by dynamically selecting convolution filters and capturing long-range dependence via attention mechanism.
We introduce a new GAN-based upsampler model to improve inference quality and speed.

Modeling complex contextual interaction

Baseline StyleGAN generator composed of two networks
Mapping network maps inputs into a “style” vector
Synthesis network uses style vector to map constant tensor to output image
Sample-adaptive kernel selection creates convolution kernels on-the-fly based on text conditioning
Interleaving attention with convolution to incorporate long-range relationships
L2-distance used instead of dot product for attention logits to promote Lipschitz continuity
Cross-attention mechanism to attend to individual word embeddings

Generator design

Text and latent-code conditioning is used to extract text embedding from the prompt.
Text embedding is tokenized and processed with additional attention layers.
Synthesis network consists of upsampling convolutional layers with adaptive kernel selection and attention layers.
Generator outputs a multi-scale image pyramid with 5 levels.
Training details are included in Appendix A.1.

Discriminator design

Discriminator consists of two branches for processing text and images
Introduce a new way of making predictions on multiple scales
Text branch processes text similar to generator
Image branch receives an image pyramid and makes independent predictions for each image scale
Predictions are made at all subsequent scales of the downsampling layers
Text descriptor extracted from text c
Early, low-resolution layers of the generator become inactive
Model architecture redesigned to provide training signals across multiple scales
Discriminator produces L(L-1) 2 predictions
Text and image features compared using function ψ
CLIP and Vision-Aided GAN losses used to improve stability

Gan-based upsampler

GigaGAN framework can be extended to train a text-conditioned superresolution model
Model is rearranged to an asymmetric U-Net architecture
Model is trained with same losses as base model, plus LPIPS Perceptual Loss
Gaussian noise augmentation is applied during training and inference
GigaGAN is more effective for superresolution task than diffusion-based models

Experiments

Systematic, controlled evaluation of large-scale text-to-image synthesis tasks is difficult
Comparing model to recent text-to-image models
Evaluating model on ImageNet class-conditional generation
Using Fréchet Inception Distance (FID) and CLIP score for quantitative evaluation
Five different experiments

Training and evaluation details

Implemented GigaGAN using StudioGAN Py-Torch library
Followed standard FID evaluation protocol with anti-aliasing bicubic resize function
Trained models on union of LAION2B-en and COYO-700M datasets
Preprocessed image-text pairs based on CLIP score, image resolution, and aesthetic score
Used CLIP ViT-L/14 for pre-trained text encoder and OpenCLIP ViT-G/14 for CLIP score calculation
Generated four outputs using prompts “a X on tabletop”
Re-computed text embeddings and style codes using new prompts
Applied them to second half layers of generator for layout-preserving fine style control
Cross-attention mechanism localized style to object of interest
Trained and evaluated models on A100 GPUs

Effectiveness of proposed components

Baseline set up by adding text-conditioning to StyleGAN2 and tuning configuration based on StyleGAN-XL
Increasing model size does not improve FID and CLIP scores
Adding components one by one improves performance
Final formulation is more scalable, higher-capacity version achieves better performance

Text-to-image synthesis

Trained a larger model with increased capacity
Compared performance to various text-to-image generative models
Achieved lower FID than other models
Generated promising images from arbitrary text prompts

Comparison with distilled diffusion models

GigaGAN is 20 times faster than other diffusion models
SDdistilled is an effort to improve inference speed
GigaGAN is faster than SDdistilled and has better FID and CLIP scores
FID and CLIP scores are reported on COCO2017 dataset with images resized to 512px

Super-resolution for large-scale image synthesis

GigaGAN is evaluated in two parts
GigaGAN is compared to several commonly-used upsamplers
GigaGAN outperforms other upsamplers in realism, text alignment, and closeness to ground truth
GigaGAN achieves best IS and FID scores with single feedforward pass

Controllable image synthesis

StyleGANs have a linear latent space called W-space for image manipulation
GigaGAN also has a disentangled W-space
GigaGAN has another latent space of text embedding t
Text embedding t and style code w can be used to control style manipulation

Discussion and limitations

GANs can scale up to model sizes that enable text-to-image synthesis
Visual quality of results is not yet comparable to production-grade models
GigaGAN architecture opens up a new design space for large-scale generative models
Performance expected to improve with larger models

C. text-to-image synthesis results

GAN model can use truncation trick to trade diversity for fidelity
Truncation trick is straightforward to apply for unconditional case, less clear for text-conditional image generation
Interpolating latent vector towards mean of entire distribution and mean of w conditioned on text prompt produces desirable results
Truncation has similar effect to guidance technique of diffusion models

Link to paper#

Abstract#

Paper Content#

Introduction#

Related works#

Method#

Modeling complex contextual interaction#

Generator design#

Discriminator design#

Gan-based upsampler#

Experiments#

Training and evaluation details#

Effectiveness of proposed components#

Text-to-image synthesis#

Comparison with distilled diffusion models#

Super-resolution for large-scale image synthesis#

Controllable image synthesis#

Discussion and limitations#

C. text-to-image synthesis results#

Link to paper

Abstract

Paper Content

Introduction

Related works

Method

Modeling complex contextual interaction

Generator design

Discriminator design

Gan-based upsampler

Experiments

Training and evaluation details

Effectiveness of proposed components

Text-to-image synthesis

Comparison with distilled diffusion models

Super-resolution for large-scale image synthesis

Controllable image synthesis

Discussion and limitations

C. text-to-image synthesis results