Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Proposed a non-autoregressive generative model for high-quality image synthesis
- Leveraged the hierarchical nature of images to encode visual tokens into stratified levels
- Improved NAR generation and outperformed existing DMs and AR methods
- Achieved FID scores of 3.96 at 256*256 resolution on ImageNet without guidance
- Achieved FID of 3.36 and IS of 259.3 when equipped with classifier-free guidance
- Showed compelling properties on applications including domain transfer
Paper Content
Introduction
- Image generation has achieved significant progress in content creation, editing and other applications.
- Leading methods, such as diffusion models and autoregressive transformers, have surpassed prior works based on GANs.
- Autoregressive and diffusion models are compute-demanding and have slow sampling speeds.
- Non-autoregressive transformers have been explored and demonstrated promising generation quality and efficiency.
- Vector Quantization reduces computational complexity but trades reconstruction quality for efficiency.
- Proposed stratified nonautoregressive model motivated by the actual human painting process.
- Cross-scale Masked Token Modeling strategy used to train top and bottom-level modules.
- Proposed method significantly out-performs existing state-of-the-art AR and DMs in on the ImageNet benchmark.
Non-autoregressive image generation
- Computationally infeasible to directly model pixel dependencies for high-resolution images
- Two-stage approach: visual tokenization and masked token modeling with transformers
- Visual tokenization: encoder, quantizer, decoder
- Masked token modeling with transformers: predict masked image tokens
- Iterative refinement during inference
Problems of nar image generation
- NAR methods are faster than AR and DMs but have lower sample quality
- AR and DMs have improved by scaling up architectures and using mild downsampling rates
- Scaling up NAR models and sequences is not enough to close the performance gap
Stratified image transformer
- Novel framework called StraIT to improve NAR model for high quality image synthesis
- Tokenization step and image stratification to enable fine-grained control
- Strategy of decoupled non-autoregressive modeling
Image stratification via tokenization
- Generative transformers process vision contents into a sequence of discrete tokens
- Hierarchies of vision contents are neglected in this style
- Leverage image hierarchy for sequence modeling to achieve better generation results
- Decompose image into two stratified representations to reduce difficulty of modeling long sequences
- Training objectives include perceptual loss, adversarial loss, and commitment loss
Cross-scale masked token modeling
- Represent image with two dependent sequences
- Propose to learn two decoupled transformers
- Top-level transformer: generate tokens from scratch
- Bottom-level transformer: predict tokens given top-level inputs
- Train transformers with Cross-scale Masked Token Modeling
Inference with strait
- Stratified Iterative Decoding is used to generate images
- Decoding process is done in a top-down manner
- Top-level transformer predicts all tokens from a blank canvas
- Bottom-level transformer performs conditional iterative decoding on N
Experiments
- Evaluated performance of StraIT on image generation
- Quantitative and qualitative evaluations on standard class-conditional image generation on ImageNet
- Ablation studies to understand stratified modeling process and advantages over different variants
- Analyzed intriguing property of system and showed compelling applications
Experimental setup
- Trained tokenizer VQGAN2-R with 8192 tokens using 256x256 images from ImageNet
- Downsampled images by factors of 16 and 8
- Used same codebook to train model on 512x512
- Two transformer architectures: top-level with 48 layers, 5120 intermediate size, 32 attention heads; bottom-level with 16 layers, 4096 intermediate size, 24 attention heads
- Trained models with AdamW, linear warmup, cosine decaying schedule, label smoothing, dropout
- Trained MaskGIT with 1.3B parameters
- Compared models without guidance and with classifier-free guidance
Main results on image synthesis
- Our method significantly outperforms previous state-of-the-arts in both Fréchet Inception Distance (FID) and Inception Score (IS).
- Non-autoregressive model outperforms autoregressive and diffusion models with fewer steps.
- Trade-off between precision and recall is maintained.
- Best FID and IS on ImageNet generation on record without classifiers or rejection sampling.
- Generate higher resolution (512x512) in a purely non-autoregressive manner.
- User preference study shows our method is more preferred than all other competing methods in terms of quality and diversity.
- Ablation study shows aggressive spatial compression eliminates high frequency in images.
- Stratified tokenizer extracts hierarchical and interlinked visual token sequences.
- Decoding steps study shows importance of proper tokenizers for generative vision transformer.
- Stratified over cascade comparison shows validity of adopting guidance for longer sequences from shorter ones.
- VQGAN2-R behaves distinctively with VQGAN2-C.
- Semantic domain transfer enabled by decoupled modeling.
- Bottom-level re-prediction benefits domain transfer.
- Visual tokenization converts image into sequence of discrete codes.
- Non-autoregressive models have nature of learning to infill.
- Flexible image editing in simple feedforward passes.