Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Presents X-Decoder, a model that can predict pixel-level segmentation and language tokens
- Takes two types of queries as input: generic non-semantic and semantic queries
- Enables seamless interactions across tasks at different granularities
- Pretrained on limited segmentation data and millions of image-text pairs
- Achieves state-of-the-art results on open-vocabulary segmentation and referring segmentation on 8 datasets
Paper Content
Introduction
- Visual understanding at different levels of granularity is a longstanding problem in the vision community
- Tasks span from image-level (e.g. classification, retrieval, captioning, VQA) to region-level (e.g. object detection, phrase grounding) to pixel-level (e.g. segmentation)
- Recently, transformers have been used to build general-purpose models that can learn from and be applied to a diverse set of vision and vision-language tasks
- Pixel-level understanding is one of the most important yet challenging problems
- X-Decoder is a generalized decoder that unifies pixel-level and image-level vision-language understanding
- X-Decoder takes two sets of queries as input and predicts two types of outputs
- An end-to-end learning paradigm is proposed to learn from all granularities of supervision
- X-Decoder supports a diversity of tasks in a zero-shot and open-vocabulary manner
- X-Decoder exhibits strong transferability to a wide range of segmentation and VL tasks
From specialist to generalist models
Pixel-level understanding
- Pixel-level image understanding is a long-standing problem
- Three tasks for pixel-level understanding: semantic, instance, and panoptic segmentation
- Models have evolved from CNN-based to transformer-based
- MSeg manually merges datasets to train a more generalized model
- Recent works transfer or distill knowledge from foundation models
- Referring segmentation is open-vocabulary
- X-Decoder is the first model to tackle generic and referring segmentation tasks in one model
Vision-language understanding
- VL pretraining has been effective for various VL tasks
- Models have evolved from transformer fusion to end-to-end transformers
- Image-text data at scale can help with visual representation learning
- VL pretrained models can be extended to region-level tasks
- Unified frameworks have been proposed to combine image-text pairs with region-level data
- Trend from specialist models to generalist models
X-decoder
Formulation
- Model follows encoder-decoder architecture
- Input image is encoded into features
- Textual query is encoded into visual features
- Visual features, textual queries and non-semantic queries are fed to X-Decoder
- X-Decoder predicts pixel-level masks and token-level semantics
- X-Decoder queries are divided into latent and text queries
- Outputs are divided into pixel-level masks and semantic embeddings
- Text encoder is used to encode textual corpus from all tasks
- Image and text encoders are fully decoupled
Unification of tasks
- X-Decoder can be used to unify different vision and vision-language tasks.
- For generic segmentation, no textual queries are needed.
- For referring segmentation, both latent and text queries are used.
- Image-text retrieval uses only latent queries.
- Image captioning and VQA use both latent and text queries.
Unified architecture
- Extract hierarchical visual features from L layers
- Cross-attend visual features and perform self-attention among latent and text queries
- Self-attention designed to prompt synergy of tasks
- Output of X-Decoder is categorized into pixel-wise mask and semantic outputs
- Text encoder consists of transformer layers
End-to-end pre-training
- Train X-Decoder in an end-to-end manner with two types of losses
- Three losses on semantic outputs for three tasks
- Compute language-image contrastive loss
- Compute bidirectional cross-entropy loss
- Compute binary cross-entropy loss and dice loss for masks
Experiments
Experimental setup
- Pretrained X-Decoder on three types of data: panoptic segmentation, image-text pairs, and referring segmentation
- Used COCO2017 with segmentation annotations and excluded validation sets of Ref-COCOg UMD and COCO Karpathy
- 104k images for segmentation pretraining, 30k images with referring segmentation annotations
- Used 4M corpora for image-text pairs
- Evaluated models on all tasks covered by pretraining
- 10 settings of 7 datasets covering a wide range of domains
- Finetuned and reported results on VQA for fine-grained visual reasoning
- Visual encoder follows [12], transformer text encoder with causal masking [64,94]
- Pretrained models for 50 epochs using AdamW [52] as the optimizer
Task-specific transfer
- X-Decoder is finetuned to demonstrate task transfer capability
- X-Decoder outperforms current SoTA on ADE Panoptic Segmentation
- X-Decoder attains comparable performance to Mask2Former and kMaX-DeepLab on COCO
- X-Decoder outperforms strong baseline UNITER and rivals VinVL on COCO retrieval
- X-Decoder outperforms VinVL on CIDEr and BLEU
- X-Decoder outperforms UViM, Pix2Seq v2, GLIPv2, UniT, GPV, UniTAB and Unified-IO
Zero-shot transfer
- X-Decoder can be applied to various segmentation tasks and datasets after pretraining
- Evaluated X-Decoder in a zero-shot manner on seven commonly used segmentation datasets
- X-Decoder-Seg shows advantages over MSeg
- Extra supervision from COCO captions improves model performance
- X-Decoder outperforms OpenSeg and MaskCLIP on 10 settings of 7 datasets across three segmentation tasks
Model inspection
- Image-text retrieval helps open-vocabulary segmentation
- Image captioning and referring segmentation help each other
- Image captioning and retrieval mutually benefit each other
- Language-condition is important for referring segmentation
Task composition
- X-Decoder has the benefit of task interaction
- X-Decoder enables joint task inference and iterative task inference with a single set of weights
- X-Decoder can perform region-based retrieval and referring based captioning without any architecture/weight change
- X-Decoder can localize a given word and modulate the predicted mask in the cross-attention layers
- X-Decoder can be integrated with diffusion model to do referring image editing
Conclusion
- We present X-Decoder, a model that supports pixel-level and image-level vision-language understanding
- X-Decoder has a simple and generalized design
- X-Decoder is pretrained with 50 epochs of COCO data and 45 epochs of 10 million image-text pairs
- AdamW optimizer is used with initial learning rate 1e-4
- Learning rate decays by 0.1 on fraction [0.88889, 0.96296] of training steps
- Finetuned for 10 epochs using AdamW optimizer, image resolution 384, batch size 2048, learning rates 3e-5 and 3e-6
- Finetuned for 10 epochs using AdamW optimizer, image resolution 480, batch size 256, learning rates 2e-5 and 2e-6
- Finetuned for 10 epochs using AdamW optimizer, image resolution 640, batch size 256, learning rates 1e-4, 1e-5 and 1e-3
- Finetuned for 24 epochs using AdamW optimizer, initial learning rate 1e-4, batch size 64 and 32
- Finetuned for 24 epochs using AdamW optimizer, initial learning rate 1e-5, batch size 64, learning rate multiplied by 0.1
- Open vocabulary segmentation benchmark on 9 datasets with different evaluation metrics
- X-Decoder exhibits strong generalization ability to segment images in ten settings of seven datasets
- 25 datasets compiled into segmentation in the wild (SegInW) benchmark
- X-Decoder shows reasonably good generalization ability to a wide range of visual and concept domains
- X-Decoder has privilege on small-scale tuning
- Zero-shot gap can be bridged by tuning
- Tuning class embedding is enough for few-shot settings
- Generalization ability to video datasets and flexibility to support task compositions for X-Decoder