Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Presents X-Decoder, a model that can predict pixel-level segmentation and language tokens
  • Takes two types of queries as input: generic non-semantic and semantic queries
  • Enables seamless interactions across tasks at different granularities
  • Pretrained on limited segmentation data and millions of image-text pairs
  • Achieves state-of-the-art results on open-vocabulary segmentation and referring segmentation on 8 datasets

Paper Content

Introduction

  • Visual understanding at different levels of granularity is a longstanding problem in the vision community
  • Tasks span from image-level (e.g. classification, retrieval, captioning, VQA) to region-level (e.g. object detection, phrase grounding) to pixel-level (e.g. segmentation)
  • Recently, transformers have been used to build general-purpose models that can learn from and be applied to a diverse set of vision and vision-language tasks
  • Pixel-level understanding is one of the most important yet challenging problems
  • X-Decoder is a generalized decoder that unifies pixel-level and image-level vision-language understanding
  • X-Decoder takes two sets of queries as input and predicts two types of outputs
  • An end-to-end learning paradigm is proposed to learn from all granularities of supervision
  • X-Decoder supports a diversity of tasks in a zero-shot and open-vocabulary manner
  • X-Decoder exhibits strong transferability to a wide range of segmentation and VL tasks

From specialist to generalist models

Pixel-level understanding

  • Pixel-level image understanding is a long-standing problem
  • Three tasks for pixel-level understanding: semantic, instance, and panoptic segmentation
  • Models have evolved from CNN-based to transformer-based
  • MSeg manually merges datasets to train a more generalized model
  • Recent works transfer or distill knowledge from foundation models
  • Referring segmentation is open-vocabulary
  • X-Decoder is the first model to tackle generic and referring segmentation tasks in one model

Vision-language understanding

  • VL pretraining has been effective for various VL tasks
  • Models have evolved from transformer fusion to end-to-end transformers
  • Image-text data at scale can help with visual representation learning
  • VL pretrained models can be extended to region-level tasks
  • Unified frameworks have been proposed to combine image-text pairs with region-level data
  • Trend from specialist models to generalist models

X-decoder

Formulation

  • Model follows encoder-decoder architecture
  • Input image is encoded into features
  • Textual query is encoded into visual features
  • Visual features, textual queries and non-semantic queries are fed to X-Decoder
  • X-Decoder predicts pixel-level masks and token-level semantics
  • X-Decoder queries are divided into latent and text queries
  • Outputs are divided into pixel-level masks and semantic embeddings
  • Text encoder is used to encode textual corpus from all tasks
  • Image and text encoders are fully decoupled

Unification of tasks

  • X-Decoder can be used to unify different vision and vision-language tasks.
  • For generic segmentation, no textual queries are needed.
  • For referring segmentation, both latent and text queries are used.
  • Image-text retrieval uses only latent queries.
  • Image captioning and VQA use both latent and text queries.

Unified architecture

  • Extract hierarchical visual features from L layers
  • Cross-attend visual features and perform self-attention among latent and text queries
  • Self-attention designed to prompt synergy of tasks
  • Output of X-Decoder is categorized into pixel-wise mask and semantic outputs
  • Text encoder consists of transformer layers

End-to-end pre-training

  • Train X-Decoder in an end-to-end manner with two types of losses
  • Three losses on semantic outputs for three tasks
  • Compute language-image contrastive loss
  • Compute bidirectional cross-entropy loss
  • Compute binary cross-entropy loss and dice loss for masks

Experiments

Experimental setup

  • Pretrained X-Decoder on three types of data: panoptic segmentation, image-text pairs, and referring segmentation
  • Used COCO2017 with segmentation annotations and excluded validation sets of Ref-COCOg UMD and COCO Karpathy
  • 104k images for segmentation pretraining, 30k images with referring segmentation annotations
  • Used 4M corpora for image-text pairs
  • Evaluated models on all tasks covered by pretraining
  • 10 settings of 7 datasets covering a wide range of domains
  • Finetuned and reported results on VQA for fine-grained visual reasoning
  • Visual encoder follows [12], transformer text encoder with causal masking [64,94]
  • Pretrained models for 50 epochs using AdamW [52] as the optimizer

Task-specific transfer

  • X-Decoder is finetuned to demonstrate task transfer capability
  • X-Decoder outperforms current SoTA on ADE Panoptic Segmentation
  • X-Decoder attains comparable performance to Mask2Former and kMaX-DeepLab on COCO
  • X-Decoder outperforms strong baseline UNITER and rivals VinVL on COCO retrieval
  • X-Decoder outperforms VinVL on CIDEr and BLEU
  • X-Decoder outperforms UViM, Pix2Seq v2, GLIPv2, UniT, GPV, UniTAB and Unified-IO

Zero-shot transfer

  • X-Decoder can be applied to various segmentation tasks and datasets after pretraining
  • Evaluated X-Decoder in a zero-shot manner on seven commonly used segmentation datasets
  • X-Decoder-Seg shows advantages over MSeg
  • Extra supervision from COCO captions improves model performance
  • X-Decoder outperforms OpenSeg and MaskCLIP on 10 settings of 7 datasets across three segmentation tasks

Model inspection

  • Image-text retrieval helps open-vocabulary segmentation
  • Image captioning and referring segmentation help each other
  • Image captioning and retrieval mutually benefit each other
  • Language-condition is important for referring segmentation

Task composition

  • X-Decoder has the benefit of task interaction
  • X-Decoder enables joint task inference and iterative task inference with a single set of weights
  • X-Decoder can perform region-based retrieval and referring based captioning without any architecture/weight change
  • X-Decoder can localize a given word and modulate the predicted mask in the cross-attention layers
  • X-Decoder can be integrated with diffusion model to do referring image editing

Conclusion

  • We present X-Decoder, a model that supports pixel-level and image-level vision-language understanding
  • X-Decoder has a simple and generalized design
  • X-Decoder is pretrained with 50 epochs of COCO data and 45 epochs of 10 million image-text pairs
  • AdamW optimizer is used with initial learning rate 1e-4
  • Learning rate decays by 0.1 on fraction [0.88889, 0.96296] of training steps
  • Finetuned for 10 epochs using AdamW optimizer, image resolution 384, batch size 2048, learning rates 3e-5 and 3e-6
  • Finetuned for 10 epochs using AdamW optimizer, image resolution 480, batch size 256, learning rates 2e-5 and 2e-6
  • Finetuned for 10 epochs using AdamW optimizer, image resolution 640, batch size 256, learning rates 1e-4, 1e-5 and 1e-3
  • Finetuned for 24 epochs using AdamW optimizer, initial learning rate 1e-4, batch size 64 and 32
  • Finetuned for 24 epochs using AdamW optimizer, initial learning rate 1e-5, batch size 64, learning rate multiplied by 0.1
  • Open vocabulary segmentation benchmark on 9 datasets with different evaluation metrics
  • X-Decoder exhibits strong generalization ability to segment images in ten settings of seven datasets
  • 25 datasets compiled into segmentation in the wild (SegInW) benchmark
  • X-Decoder shows reasonably good generalization ability to a wide range of visual and concept domains
  • X-Decoder has privilege on small-scale tuning
  • Zero-shot gap can be bridged by tuning
  • Tuning class embedding is enough for few-shot settings
  • Generalization ability to video datasets and flexibility to support task compositions for X-Decoder