Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Proposed a non-autoregressive generative model for high-quality image synthesis
Leveraged the hierarchical nature of images to encode visual tokens into stratified levels
Improved NAR generation and outperformed existing DMs and AR methods
Achieved FID scores of 3.96 at 256*256 resolution on ImageNet without guidance
Achieved FID of 3.36 and IS of 259.3 when equipped with classifier-free guidance
Showed compelling properties on applications including domain transfer

Paper Content

Introduction

Image generation has achieved significant progress in content creation, editing and other applications.
Leading methods, such as diffusion models and autoregressive transformers, have surpassed prior works based on GANs.
Autoregressive and diffusion models are compute-demanding and have slow sampling speeds.
Non-autoregressive transformers have been explored and demonstrated promising generation quality and efficiency.
Vector Quantization reduces computational complexity but trades reconstruction quality for efficiency.
Proposed stratified nonautoregressive model motivated by the actual human painting process.
Cross-scale Masked Token Modeling strategy used to train top and bottom-level modules.
Proposed method significantly out-performs existing state-of-the-art AR and DMs in on the ImageNet benchmark.

Non-autoregressive image generation

Computationally infeasible to directly model pixel dependencies for high-resolution images
Two-stage approach: visual tokenization and masked token modeling with transformers
Visual tokenization: encoder, quantizer, decoder
Masked token modeling with transformers: predict masked image tokens
Iterative refinement during inference

Problems of nar image generation

NAR methods are faster than AR and DMs but have lower sample quality
AR and DMs have improved by scaling up architectures and using mild downsampling rates
Scaling up NAR models and sequences is not enough to close the performance gap

Stratified image transformer

Novel framework called StraIT to improve NAR model for high quality image synthesis
Tokenization step and image stratification to enable fine-grained control
Strategy of decoupled non-autoregressive modeling

Image stratification via tokenization

Generative transformers process vision contents into a sequence of discrete tokens
Hierarchies of vision contents are neglected in this style
Leverage image hierarchy for sequence modeling to achieve better generation results
Decompose image into two stratified representations to reduce difficulty of modeling long sequences
Training objectives include perceptual loss, adversarial loss, and commitment loss

Cross-scale masked token modeling

Represent image with two dependent sequences
Propose to learn two decoupled transformers
Top-level transformer: generate tokens from scratch
Bottom-level transformer: predict tokens given top-level inputs
Train transformers with Cross-scale Masked Token Modeling

Inference with strait

Stratified Iterative Decoding is used to generate images
Decoding process is done in a top-down manner
Top-level transformer predicts all tokens from a blank canvas
Bottom-level transformer performs conditional iterative decoding on N

Experiments

Evaluated performance of StraIT on image generation
Quantitative and qualitative evaluations on standard class-conditional image generation on ImageNet
Ablation studies to understand stratified modeling process and advantages over different variants
Analyzed intriguing property of system and showed compelling applications

Experimental setup

Trained tokenizer VQGAN2-R with 8192 tokens using 256x256 images from ImageNet
Downsampled images by factors of 16 and 8
Used same codebook to train model on 512x512
Two transformer architectures: top-level with 48 layers, 5120 intermediate size, 32 attention heads; bottom-level with 16 layers, 4096 intermediate size, 24 attention heads
Trained models with AdamW, linear warmup, cosine decaying schedule, label smoothing, dropout
Trained MaskGIT with 1.3B parameters
Compared models without guidance and with classifier-free guidance

Main results on image synthesis

Our method significantly outperforms previous state-of-the-arts in both Fréchet Inception Distance (FID) and Inception Score (IS).
Non-autoregressive model outperforms autoregressive and diffusion models with fewer steps.
Trade-off between precision and recall is maintained.
Best FID and IS on ImageNet generation on record without classifiers or rejection sampling.
Generate higher resolution (512x512) in a purely non-autoregressive manner.
User preference study shows our method is more preferred than all other competing methods in terms of quality and diversity.
Ablation study shows aggressive spatial compression eliminates high frequency in images.
Stratified tokenizer extracts hierarchical and interlinked visual token sequences.
Decoding steps study shows importance of proper tokenizers for generative vision transformer.
Stratified over cascade comparison shows validity of adopting guidance for longer sequences from shorter ones.
VQGAN2-R behaves distinctively with VQGAN2-C.
Semantic domain transfer enabled by decoupled modeling.
Bottom-level re-prediction benefits domain transfer.
Visual tokenization converts image into sequence of discrete codes.
Non-autoregressive models have nature of learning to infill.
Flexible image editing in simple feedforward passes.

Link to paper#

Abstract#

Paper Content#

Introduction#

Non-autoregressive image generation#

Problems of nar image generation#

Stratified image transformer#

Image stratification via tokenization#

Cross-scale masked token modeling#

Inference with strait#

Experiments#

Experimental setup#

Main results on image synthesis#

Link to paper

Abstract

Paper Content

Introduction

Non-autoregressive image generation

Problems of nar image generation

Stratified image transformer

Image stratification via tokenization

Cross-scale masked token modeling

Inference with strait

Experiments

Experimental setup

Main results on image synthesis