Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Presents X-Decoder, a model that can predict pixel-level segmentation and language tokens
Takes two types of queries as input: generic non-semantic and semantic queries
Enables seamless interactions across tasks at different granularities
Pretrained on limited segmentation data and millions of image-text pairs
Achieves state-of-the-art results on open-vocabulary segmentation and referring segmentation on 8 datasets

Paper Content

Introduction

Visual understanding at different levels of granularity is a longstanding problem in the vision community
Tasks span from image-level (e.g. classification, retrieval, captioning, VQA) to region-level (e.g. object detection, phrase grounding) to pixel-level (e.g. segmentation)
Recently, transformers have been used to build general-purpose models that can learn from and be applied to a diverse set of vision and vision-language tasks
Pixel-level understanding is one of the most important yet challenging problems
X-Decoder is a generalized decoder that unifies pixel-level and image-level vision-language understanding
X-Decoder takes two sets of queries as input and predicts two types of outputs
An end-to-end learning paradigm is proposed to learn from all granularities of supervision
X-Decoder supports a diversity of tasks in a zero-shot and open-vocabulary manner
X-Decoder exhibits strong transferability to a wide range of segmentation and VL tasks

From specialist to generalist models

Pixel-level understanding

Pixel-level image understanding is a long-standing problem
Three tasks for pixel-level understanding: semantic, instance, and panoptic segmentation
Models have evolved from CNN-based to transformer-based
MSeg manually merges datasets to train a more generalized model
Recent works transfer or distill knowledge from foundation models
Referring segmentation is open-vocabulary
X-Decoder is the first model to tackle generic and referring segmentation tasks in one model

Vision-language understanding

VL pretraining has been effective for various VL tasks
Models have evolved from transformer fusion to end-to-end transformers
Image-text data at scale can help with visual representation learning
VL pretrained models can be extended to region-level tasks
Unified frameworks have been proposed to combine image-text pairs with region-level data
Trend from specialist models to generalist models

X-decoder

Formulation

Model follows encoder-decoder architecture
Input image is encoded into features
Textual query is encoded into visual features
Visual features, textual queries and non-semantic queries are fed to X-Decoder
X-Decoder predicts pixel-level masks and token-level semantics
X-Decoder queries are divided into latent and text queries
Outputs are divided into pixel-level masks and semantic embeddings
Text encoder is used to encode textual corpus from all tasks
Image and text encoders are fully decoupled

Unification of tasks

X-Decoder can be used to unify different vision and vision-language tasks.
For generic segmentation, no textual queries are needed.
For referring segmentation, both latent and text queries are used.
Image-text retrieval uses only latent queries.
Image captioning and VQA use both latent and text queries.

Unified architecture

Extract hierarchical visual features from L layers
Cross-attend visual features and perform self-attention among latent and text queries
Self-attention designed to prompt synergy of tasks
Output of X-Decoder is categorized into pixel-wise mask and semantic outputs
Text encoder consists of transformer layers

End-to-end pre-training

Train X-Decoder in an end-to-end manner with two types of losses
Three losses on semantic outputs for three tasks
Compute language-image contrastive loss
Compute bidirectional cross-entropy loss
Compute binary cross-entropy loss and dice loss for masks

Experiments

Experimental setup

Pretrained X-Decoder on three types of data: panoptic segmentation, image-text pairs, and referring segmentation
Used COCO2017 with segmentation annotations and excluded validation sets of Ref-COCOg UMD and COCO Karpathy
104k images for segmentation pretraining, 30k images with referring segmentation annotations
Used 4M corpora for image-text pairs
Evaluated models on all tasks covered by pretraining
10 settings of 7 datasets covering a wide range of domains
Finetuned and reported results on VQA for fine-grained visual reasoning
Visual encoder follows [12], transformer text encoder with causal masking [64,94]
Pretrained models for 50 epochs using AdamW [52] as the optimizer

Task-specific transfer

X-Decoder is finetuned to demonstrate task transfer capability
X-Decoder outperforms current SoTA on ADE Panoptic Segmentation
X-Decoder attains comparable performance to Mask2Former and kMaX-DeepLab on COCO
X-Decoder outperforms strong baseline UNITER and rivals VinVL on COCO retrieval
X-Decoder outperforms VinVL on CIDEr and BLEU
X-Decoder outperforms UViM, Pix2Seq v2, GLIPv2, UniT, GPV, UniTAB and Unified-IO

Zero-shot transfer

X-Decoder can be applied to various segmentation tasks and datasets after pretraining
Evaluated X-Decoder in a zero-shot manner on seven commonly used segmentation datasets
X-Decoder-Seg shows advantages over MSeg
Extra supervision from COCO captions improves model performance
X-Decoder outperforms OpenSeg and MaskCLIP on 10 settings of 7 datasets across three segmentation tasks

Model inspection

Image-text retrieval helps open-vocabulary segmentation
Image captioning and referring segmentation help each other
Image captioning and retrieval mutually benefit each other
Language-condition is important for referring segmentation

Task composition

X-Decoder has the benefit of task interaction
X-Decoder enables joint task inference and iterative task inference with a single set of weights
X-Decoder can perform region-based retrieval and referring based captioning without any architecture/weight change
X-Decoder can localize a given word and modulate the predicted mask in the cross-attention layers
X-Decoder can be integrated with diffusion model to do referring image editing

Conclusion

We present X-Decoder, a model that supports pixel-level and image-level vision-language understanding
X-Decoder has a simple and generalized design
X-Decoder is pretrained with 50 epochs of COCO data and 45 epochs of 10 million image-text pairs
AdamW optimizer is used with initial learning rate 1e-4
Learning rate decays by 0.1 on fraction [0.88889, 0.96296] of training steps
Finetuned for 10 epochs using AdamW optimizer, image resolution 384, batch size 2048, learning rates 3e-5 and 3e-6
Finetuned for 10 epochs using AdamW optimizer, image resolution 480, batch size 256, learning rates 2e-5 and 2e-6
Finetuned for 10 epochs using AdamW optimizer, image resolution 640, batch size 256, learning rates 1e-4, 1e-5 and 1e-3
Finetuned for 24 epochs using AdamW optimizer, initial learning rate 1e-4, batch size 64 and 32
Finetuned for 24 epochs using AdamW optimizer, initial learning rate 1e-5, batch size 64, learning rate multiplied by 0.1
Open vocabulary segmentation benchmark on 9 datasets with different evaluation metrics
X-Decoder exhibits strong generalization ability to segment images in ten settings of seven datasets
25 datasets compiled into segmentation in the wild (SegInW) benchmark
X-Decoder shows reasonably good generalization ability to a wide range of visual and concept domains
X-Decoder has privilege on small-scale tuning
Zero-shot gap can be bridged by tuning
Tuning class embedding is enough for few-shot settings
Generalization ability to video datasets and flexibility to support task compositions for X-Decoder

Link to paper#

Abstract#

Paper Content#

Introduction#

From specialist to generalist models#

Pixel-level understanding#

Vision-language understanding#

X-decoder#

Formulation#

Unification of tasks#

Unified architecture#

End-to-end pre-training#

Experiments#

Experimental setup#

Task-specific transfer#

Zero-shot transfer#

Model inspection#

Task composition#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

From specialist to generalist models

Pixel-level understanding

Vision-language understanding

X-decoder

Formulation

Unification of tasks

Unified architecture

End-to-end pre-training

Experiments

Experimental setup

Task-specific transfer

Zero-shot transfer

Model inspection

Task composition

Conclusion