Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Proposed a non-autoregressive generative model for high-quality image synthesis
  • Leveraged the hierarchical nature of images to encode visual tokens into stratified levels
  • Improved NAR generation and outperformed existing DMs and AR methods
  • Achieved FID scores of 3.96 at 256*256 resolution on ImageNet without guidance
  • Achieved FID of 3.36 and IS of 259.3 when equipped with classifier-free guidance
  • Showed compelling properties on applications including domain transfer

Paper Content

Introduction

  • Image generation has achieved significant progress in content creation, editing and other applications.
  • Leading methods, such as diffusion models and autoregressive transformers, have surpassed prior works based on GANs.
  • Autoregressive and diffusion models are compute-demanding and have slow sampling speeds.
  • Non-autoregressive transformers have been explored and demonstrated promising generation quality and efficiency.
  • Vector Quantization reduces computational complexity but trades reconstruction quality for efficiency.
  • Proposed stratified nonautoregressive model motivated by the actual human painting process.
  • Cross-scale Masked Token Modeling strategy used to train top and bottom-level modules.
  • Proposed method significantly out-performs existing state-of-the-art AR and DMs in on the ImageNet benchmark.

Non-autoregressive image generation

  • Computationally infeasible to directly model pixel dependencies for high-resolution images
  • Two-stage approach: visual tokenization and masked token modeling with transformers
  • Visual tokenization: encoder, quantizer, decoder
  • Masked token modeling with transformers: predict masked image tokens
  • Iterative refinement during inference

Problems of nar image generation

  • NAR methods are faster than AR and DMs but have lower sample quality
  • AR and DMs have improved by scaling up architectures and using mild downsampling rates
  • Scaling up NAR models and sequences is not enough to close the performance gap

Stratified image transformer

  • Novel framework called StraIT to improve NAR model for high quality image synthesis
  • Tokenization step and image stratification to enable fine-grained control
  • Strategy of decoupled non-autoregressive modeling

Image stratification via tokenization

  • Generative transformers process vision contents into a sequence of discrete tokens
  • Hierarchies of vision contents are neglected in this style
  • Leverage image hierarchy for sequence modeling to achieve better generation results
  • Decompose image into two stratified representations to reduce difficulty of modeling long sequences
  • Training objectives include perceptual loss, adversarial loss, and commitment loss

Cross-scale masked token modeling

  • Represent image with two dependent sequences
  • Propose to learn two decoupled transformers
  • Top-level transformer: generate tokens from scratch
  • Bottom-level transformer: predict tokens given top-level inputs
  • Train transformers with Cross-scale Masked Token Modeling

Inference with strait

  • Stratified Iterative Decoding is used to generate images
  • Decoding process is done in a top-down manner
  • Top-level transformer predicts all tokens from a blank canvas
  • Bottom-level transformer performs conditional iterative decoding on N

Experiments

  • Evaluated performance of StraIT on image generation
  • Quantitative and qualitative evaluations on standard class-conditional image generation on ImageNet
  • Ablation studies to understand stratified modeling process and advantages over different variants
  • Analyzed intriguing property of system and showed compelling applications

Experimental setup

  • Trained tokenizer VQGAN2-R with 8192 tokens using 256x256 images from ImageNet
  • Downsampled images by factors of 16 and 8
  • Used same codebook to train model on 512x512
  • Two transformer architectures: top-level with 48 layers, 5120 intermediate size, 32 attention heads; bottom-level with 16 layers, 4096 intermediate size, 24 attention heads
  • Trained models with AdamW, linear warmup, cosine decaying schedule, label smoothing, dropout
  • Trained MaskGIT with 1.3B parameters
  • Compared models without guidance and with classifier-free guidance

Main results on image synthesis

  • Our method significantly outperforms previous state-of-the-arts in both Fréchet Inception Distance (FID) and Inception Score (IS).
  • Non-autoregressive model outperforms autoregressive and diffusion models with fewer steps.
  • Trade-off between precision and recall is maintained.
  • Best FID and IS on ImageNet generation on record without classifiers or rejection sampling.
  • Generate higher resolution (512x512) in a purely non-autoregressive manner.
  • User preference study shows our method is more preferred than all other competing methods in terms of quality and diversity.
  • Ablation study shows aggressive spatial compression eliminates high frequency in images.
  • Stratified tokenizer extracts hierarchical and interlinked visual token sequences.
  • Decoding steps study shows importance of proper tokenizers for generative vision transformer.
  • Stratified over cascade comparison shows validity of adopting guidance for longer sequences from shorter ones.
  • VQGAN2-R behaves distinctively with VQGAN2-C.
  • Semantic domain transfer enabled by decoupled modeling.
  • Bottom-level re-prediction benefits domain transfer.
  • Visual tokenization converts image into sequence of discrete codes.
  • Non-autoregressive models have nature of learning to infill.
  • Flexible image editing in simple feedforward passes.