Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Text-to-image synthesis has been successful and captured public imagination
- GANs used to be the favored architecture for generative image models
- Auto-regressive and diffusion models have become the new standard
- Can GANs be scaled up to benefit from large datasets?
- GigaGAN is a new GAN architecture that is faster and can synthesize high-resolution images
- GigaGAN supports latent space editing applications
Paper Content
Introduction
- Recently released models have achieved high levels of image quality and model flexibility.
- Iterative methods enable stable training but are computationally expensive.
- GANs generate images through a single forward pass and are inherently efficient.
- Scaling GANs requires careful tuning and training considerations due to instabilities.
- StyleGAN2 is scaled up and several key issues are identified.
- Techniques are proposed to stabilize training while increasing model capacity.
- Multi-scale training improves image-text alignment and low-frequency details.
- GigaGAN is 36x larger than StyleGAN2 and 6x larger than StyleGAN-XL.
- GigaGAN is orders of magnitude faster and can generate ultra high-res images.
- GigaGAN has a controllable latent vector space for controllable image synthesis.
- GigaGAN is the first GAN-based method to successfully train a billion-scale model.
Related works
- Text-to-image synthesis is a challenging task
- Earlier works used text-conditional GANs on specific domains and datasets
- Recent works have shown improvement on an open-world of arbitrary text descriptions
- Sampling high-quality images requires time-consuming iterative processes
- GANs have been used for various computer vision and graphics applications
- GANs have been deployed to text-to-image synthesis
- Existing GAN-based text-to-image synthesis models are trained on relatively small datasets
- Super-resolution is used to reduce memory and running time for large-scale text-to-image models
- Traditional super-resolution techniques aim to faithfully reproduce low-resolution inputs
- Our upsamplers for large-scale models need to perform larger upsampling factors while leveraging the input text prompt
Method
- We train a generator to predict an image given a latent code and text-conditioning signal.
- We use a discriminator to judge the realism of the generated image.
- Current limitation of GANs stems from reliance on convolutional layers.
- We seek to inject more expressivity into our parameterization by dynamically selecting convolution filters and capturing long-range dependence via attention mechanism.
- We introduce a new GAN-based upsampler model to improve inference quality and speed.
Modeling complex contextual interaction
- Baseline StyleGAN generator composed of two networks
- Mapping network maps inputs into a “style” vector
- Synthesis network uses style vector to map constant tensor to output image
- Sample-adaptive kernel selection creates convolution kernels on-the-fly based on text conditioning
- Interleaving attention with convolution to incorporate long-range relationships
- L2-distance used instead of dot product for attention logits to promote Lipschitz continuity
- Cross-attention mechanism to attend to individual word embeddings
Generator design
- Text and latent-code conditioning is used to extract text embedding from the prompt.
- Text embedding is tokenized and processed with additional attention layers.
- Synthesis network consists of upsampling convolutional layers with adaptive kernel selection and attention layers.
- Generator outputs a multi-scale image pyramid with 5 levels.
- Training details are included in Appendix A.1.
Discriminator design
- Discriminator consists of two branches for processing text and images
- Introduce a new way of making predictions on multiple scales
- Text branch processes text similar to generator
- Image branch receives an image pyramid and makes independent predictions for each image scale
- Predictions are made at all subsequent scales of the downsampling layers
- Text descriptor extracted from text c
- Early, low-resolution layers of the generator become inactive
- Model architecture redesigned to provide training signals across multiple scales
- Discriminator produces L(L-1) 2 predictions
- Text and image features compared using function ψ
- CLIP and Vision-Aided GAN losses used to improve stability
Gan-based upsampler
- GigaGAN framework can be extended to train a text-conditioned superresolution model
- Model is rearranged to an asymmetric U-Net architecture
- Model is trained with same losses as base model, plus LPIPS Perceptual Loss
- Gaussian noise augmentation is applied during training and inference
- GigaGAN is more effective for superresolution task than diffusion-based models
Experiments
- Systematic, controlled evaluation of large-scale text-to-image synthesis tasks is difficult
- Comparing model to recent text-to-image models
- Evaluating model on ImageNet class-conditional generation
- Using Fréchet Inception Distance (FID) and CLIP score for quantitative evaluation
- Five different experiments
Training and evaluation details
- Implemented GigaGAN using StudioGAN Py-Torch library
- Followed standard FID evaluation protocol with anti-aliasing bicubic resize function
- Trained models on union of LAION2B-en and COYO-700M datasets
- Preprocessed image-text pairs based on CLIP score, image resolution, and aesthetic score
- Used CLIP ViT-L/14 for pre-trained text encoder and OpenCLIP ViT-G/14 for CLIP score calculation
- Generated four outputs using prompts “a X on tabletop”
- Re-computed text embeddings and style codes using new prompts
- Applied them to second half layers of generator for layout-preserving fine style control
- Cross-attention mechanism localized style to object of interest
- Trained and evaluated models on A100 GPUs
Effectiveness of proposed components
- Baseline set up by adding text-conditioning to StyleGAN2 and tuning configuration based on StyleGAN-XL
- Increasing model size does not improve FID and CLIP scores
- Adding components one by one improves performance
- Final formulation is more scalable, higher-capacity version achieves better performance
Text-to-image synthesis
- Trained a larger model with increased capacity
- Compared performance to various text-to-image generative models
- Achieved lower FID than other models
- Generated promising images from arbitrary text prompts
Comparison with distilled diffusion models
- GigaGAN is 20 times faster than other diffusion models
- SDdistilled is an effort to improve inference speed
- GigaGAN is faster than SDdistilled and has better FID and CLIP scores
- FID and CLIP scores are reported on COCO2017 dataset with images resized to 512px
Super-resolution for large-scale image synthesis
- GigaGAN is evaluated in two parts
- GigaGAN is compared to several commonly-used upsamplers
- GigaGAN outperforms other upsamplers in realism, text alignment, and closeness to ground truth
- GigaGAN achieves best IS and FID scores with single feedforward pass
Controllable image synthesis
- StyleGANs have a linear latent space called W-space for image manipulation
- GigaGAN also has a disentangled W-space
- GigaGAN has another latent space of text embedding t
- Text embedding t and style code w can be used to control style manipulation
Discussion and limitations
- GANs can scale up to model sizes that enable text-to-image synthesis
- Visual quality of results is not yet comparable to production-grade models
- GigaGAN architecture opens up a new design space for large-scale generative models
- Performance expected to improve with larger models
C. text-to-image synthesis results
- GAN model can use truncation trick to trade diversity for fidelity
- Truncation trick is straightforward to apply for unconditional case, less clear for text-conditional image generation
- Interpolating latent vector towards mean of entire distribution and mean of w conditioned on text prompt produces desirable results
- Truncation has similar effect to guidance technique of diffusion models