Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

UNINEXT is a universal instance perception model for object discovery and retrieval.
Benefits of UNINEXT include exploiting data from different tasks and label vocabularies for joint training of general instance-level representations, and being parameter-efficient when handling multiple tasks.
UNINEXT has shown superior performance on 20 challenging benchmarks from 10 instance-level tasks.

Paper Content

Introduction

Object-centric understanding is a challenging problem in computer vision
10 sub-tasks are discussed, distributed on the vertices of a cube
Object detection and instance segmentation require finding objects of specific categories
Multiple Object Tracking, Multi-Object Tracking and Segmentation, and Video Instance Segmentation require finding object trajectories of specific categories in videos
Referring Expression Comprehension, Referring Expression Segmentation, and Referring Video Object Segmentation aim to find objects matched with language expressions
Single Object Tracking and Video Object Segmentation take the target annotations given in the first frame as the reference
Fragmented task definitions split the field into pieces, causing redundant parameters and overlooking the possibility of mutual collaboration
UNINEXT is proposed as a universal instance perception model of the next generation
UNINEXT can flexibly perceive different instances by changing the input prompts
UNINEXT achieves superior performance on 20 challenging benchmarks

Retrieval by Category Names: Object detection and instance segmentation
Retrieval by Language Expressions: REC, RES, and R-VOS
Retrieval by Reference Annotations: SOT and VOS
Unified Vision Models: Unified learning paradigms and unified model architectures
Object detection and instance segmentation are foundations for other instance perception tasks
REC methods divided into two-stage, one-stage, and Transformer-based
RES approaches focus on designing diverse attention mechanisms
R-VOS is an extension of RES from images to videos
SOT and VOS extract target features and fuse target information with representations of the current frame
Unified vision models attempt to solve multiple vision or multi-modal tasks by a single model
Unified learning paradigms cover many tasks and modalities
Unified model architectures designed for a group of closely related tasks

Approach

Categorize existing instance perception tasks into three classes
Object detection, instance segmentation, MOT, MOTS, and VIS use category names as prompts
REC, RES, and R-VOS use an expression as the prompt
SOT and VOS use annotation given in the first frame as the prompt
Reformulate all instance perception tasks into a prompt-guided object discovery and retrieval problem
UNINEXT consists of three components: prompt generation, image-prompt feature fusion, object discovery and retrieval

Prompt generation

A prompt generation module is used to transform the original diverse prompt inputs into a unified form.
A language encoder is used to deal with language-related prompts.
An additional reference visual encoder is used to extract fine-grained visual features for annotation-guided tasks.
A merging module is applied to keep fine target information and get the prompt embedding in the same format.

Image-prompt feature fusion

The whole current image is passed through a visual encoder to obtain hierarchical visual features.
An early fusion module is used to enhance the original prompt embedding by the image contexts and make the original visual features prompt-aware.
A bi-directional cross-attention module is used to retrieve information from different inputs.
The retrieved representations are added to the original features.

Object discovery and retrieval

UNINEXT adopts encoder-decoder architecture for flexible query-to-instance fashion
Transformer encoder takes hierarchical prompt-aware visual features as inputs
Multi-scale Deformable Self-Attention used to exchange target information from different scales
Auxiliary prediction head generates initial reference points
Transformer decoder takes enhanced multi-scale features, reference points, and object queries as inputs
Two query generation strategies: static and dynamic queries
Instance head produces boxes and masks of targets
Embedding head associates current detected results with previous trajectories
Instance-prompt matching scores supervised by Focal Loss

Training and inference

Training process consists of 3 stages
First stage pretrains UNINEXT on Objects365
Second stage finetunes UNINEXT on COCO and mixed dataset
Third stage finetunes UNINEXT on video-level datasets
Inference is online and post-processing free

Experiments

Implementation details

Three different backbones used as visual encoder
BERT used as text encoder, parameters trained in first and second stages, frozen in last stage
Transformer encoder-decoder architecture with 6 encoder and 6 decoder layers
Number of object queries set to 900
AdamW optimizer with weight decay of 0.05
Model trained on 32 and 16 A100 GPUs

Evaluations on 10 tasks

Compared UNINEXT with task-specific counterparts in 20 datasets
Used same model parameters for all benchmarks
Compared UNINEXT with state-of-the-art object detection and instance segmentation methods
UNINEXT surpassed state-of-the-art query-based detector DN-Deformable DETR by 2.7 box AP
UNINEXT achieved a box AP of 58.1 and 60.6, surpassing Cascade Mask-RCNN and ViTDet-H by 3.3 and 1.9 respectively
UNINEXT outperformed state-of-the-art QueryInst by 4.3 AP and 6.2 AP L
UNINEXT obtained new state-of-the-art results, exceeding the previous best method by a large margin
Compared UNINEXT with state-of-the-art SOT methods on four large-scale benchmarks
UNINEXT achieved the best results in terms of AUC and P
Compared UNINEXT with previous semi-supervised VOS methods
UNINEXT achieved the best results among all non-memory-based methods
Compared UNINEXT with state-of-the-art MOT methods
UNINEXT surpassed Unicorn by 3.0 mMOTA and 2.7 mIDF1
Compared UNINEXT against state-of-the-art VIS methods
UNINEXT obtained the best results on both datasets

Ablations and other analysis

Evaluated on five benchmarks from five tasks
Early fusion has greatest impact on VOS
Removal of feature fusion causes performance drop on REC and RVOS
Feature fusion has minimum influence on object detection and VIS
Dynamic queries perform slightly better than static queries on first four tasks
Static queries outperform dynamic ones on VIS task
Unified model achieves significantly better performance than task-specific counterparts on five tasks

Conclusions

We propose UNINEXT, a universal instance perception model of the next generation
UNINEXT unifies 10 instance perception tasks with a prompt-guided object discovery and retrieval paradigm
Experiments demonstrate that UNINEXT achieves superior performance on 20 challenging benchmarks
Training process consists of three stages with StepLR learning rate scheduler
Multi-scale training technique is used across all datasets
Loss functions include L retrieve, L box, L mask, L boxinst mask, L pairwise, and L embed
Transformer architecture is used to transform enhanced visual features and prompt features into final instance predictions
Mask head is introduced for segmentation
SimOTA is used to enable multiple queries to be matched with one GT
IoU branch is added to reflect the quality of the predicted boxes
Contrastive DN, mixed query selection, and look forward twice are used to improve performance
UNINEXT outperforms other competitive counterparts on 10 tasks
UNINEXT can accurately locate the target referred by language expressions and precisely track and segment the targets in complex scenarios

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Approach#

Prompt generation#

Image-prompt feature fusion#

Object discovery and retrieval#

Training and inference#

Experiments#

Implementation details#

Evaluations on 10 tasks#

Ablations and other analysis#

Conclusions#

Link to paper

Abstract

Paper Content

Introduction

Related work

Approach

Prompt generation

Image-prompt feature fusion

Object discovery and retrieval

Training and inference

Experiments

Implementation details

Evaluations on 10 tasks

Ablations and other analysis

Conclusions