Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- UNINEXT is a universal instance perception model for object discovery and retrieval.
- Benefits of UNINEXT include exploiting data from different tasks and label vocabularies for joint training of general instance-level representations, and being parameter-efficient when handling multiple tasks.
- UNINEXT has shown superior performance on 20 challenging benchmarks from 10 instance-level tasks.
Paper Content
Introduction
- Object-centric understanding is a challenging problem in computer vision
- 10 sub-tasks are discussed, distributed on the vertices of a cube
- Object detection and instance segmentation require finding objects of specific categories
- Multiple Object Tracking, Multi-Object Tracking and Segmentation, and Video Instance Segmentation require finding object trajectories of specific categories in videos
- Referring Expression Comprehension, Referring Expression Segmentation, and Referring Video Object Segmentation aim to find objects matched with language expressions
- Single Object Tracking and Video Object Segmentation take the target annotations given in the first frame as the reference
- Fragmented task definitions split the field into pieces, causing redundant parameters and overlooking the possibility of mutual collaboration
- UNINEXT is proposed as a universal instance perception model of the next generation
- UNINEXT can flexibly perceive different instances by changing the input prompts
- UNINEXT achieves superior performance on 20 challenging benchmarks
Related work
- Retrieval by Category Names: Object detection and instance segmentation
- Retrieval by Language Expressions: REC, RES, and R-VOS
- Retrieval by Reference Annotations: SOT and VOS
- Unified Vision Models: Unified learning paradigms and unified model architectures
- Object detection and instance segmentation are foundations for other instance perception tasks
- REC methods divided into two-stage, one-stage, and Transformer-based
- RES approaches focus on designing diverse attention mechanisms
- R-VOS is an extension of RES from images to videos
- SOT and VOS extract target features and fuse target information with representations of the current frame
- Unified vision models attempt to solve multiple vision or multi-modal tasks by a single model
- Unified learning paradigms cover many tasks and modalities
- Unified model architectures designed for a group of closely related tasks
Approach
- Categorize existing instance perception tasks into three classes
- Object detection, instance segmentation, MOT, MOTS, and VIS use category names as prompts
- REC, RES, and R-VOS use an expression as the prompt
- SOT and VOS use annotation given in the first frame as the prompt
- Reformulate all instance perception tasks into a prompt-guided object discovery and retrieval problem
- UNINEXT consists of three components: prompt generation, image-prompt feature fusion, object discovery and retrieval
Prompt generation
- A prompt generation module is used to transform the original diverse prompt inputs into a unified form.
- A language encoder is used to deal with language-related prompts.
- An additional reference visual encoder is used to extract fine-grained visual features for annotation-guided tasks.
- A merging module is applied to keep fine target information and get the prompt embedding in the same format.
Image-prompt feature fusion
- The whole current image is passed through a visual encoder to obtain hierarchical visual features.
- An early fusion module is used to enhance the original prompt embedding by the image contexts and make the original visual features prompt-aware.
- A bi-directional cross-attention module is used to retrieve information from different inputs.
- The retrieved representations are added to the original features.
Object discovery and retrieval
- UNINEXT adopts encoder-decoder architecture for flexible query-to-instance fashion
- Transformer encoder takes hierarchical prompt-aware visual features as inputs
- Multi-scale Deformable Self-Attention used to exchange target information from different scales
- Auxiliary prediction head generates initial reference points
- Transformer decoder takes enhanced multi-scale features, reference points, and object queries as inputs
- Two query generation strategies: static and dynamic queries
- Instance head produces boxes and masks of targets
- Embedding head associates current detected results with previous trajectories
- Instance-prompt matching scores supervised by Focal Loss
Training and inference
- Training process consists of 3 stages
- First stage pretrains UNINEXT on Objects365
- Second stage finetunes UNINEXT on COCO and mixed dataset
- Third stage finetunes UNINEXT on video-level datasets
- Inference is online and post-processing free
Experiments
Implementation details
- Three different backbones used as visual encoder
- BERT used as text encoder, parameters trained in first and second stages, frozen in last stage
- Transformer encoder-decoder architecture with 6 encoder and 6 decoder layers
- Number of object queries set to 900
- AdamW optimizer with weight decay of 0.05
- Model trained on 32 and 16 A100 GPUs
Evaluations on 10 tasks
- Compared UNINEXT with task-specific counterparts in 20 datasets
- Used same model parameters for all benchmarks
- Compared UNINEXT with state-of-the-art object detection and instance segmentation methods
- UNINEXT surpassed state-of-the-art query-based detector DN-Deformable DETR by 2.7 box AP
- UNINEXT achieved a box AP of 58.1 and 60.6, surpassing Cascade Mask-RCNN and ViTDet-H by 3.3 and 1.9 respectively
- UNINEXT outperformed state-of-the-art QueryInst by 4.3 AP and 6.2 AP L
- UNINEXT obtained new state-of-the-art results, exceeding the previous best method by a large margin
- Compared UNINEXT with state-of-the-art SOT methods on four large-scale benchmarks
- UNINEXT achieved the best results in terms of AUC and P
- Compared UNINEXT with previous semi-supervised VOS methods
- UNINEXT achieved the best results among all non-memory-based methods
- Compared UNINEXT with state-of-the-art MOT methods
- UNINEXT surpassed Unicorn by 3.0 mMOTA and 2.7 mIDF1
- Compared UNINEXT against state-of-the-art VIS methods
- UNINEXT obtained the best results on both datasets
Ablations and other analysis
- Evaluated on five benchmarks from five tasks
- Early fusion has greatest impact on VOS
- Removal of feature fusion causes performance drop on REC and RVOS
- Feature fusion has minimum influence on object detection and VIS
- Dynamic queries perform slightly better than static queries on first four tasks
- Static queries outperform dynamic ones on VIS task
- Unified model achieves significantly better performance than task-specific counterparts on five tasks
Conclusions
- We propose UNINEXT, a universal instance perception model of the next generation
- UNINEXT unifies 10 instance perception tasks with a prompt-guided object discovery and retrieval paradigm
- Experiments demonstrate that UNINEXT achieves superior performance on 20 challenging benchmarks
- Training process consists of three stages with StepLR learning rate scheduler
- Multi-scale training technique is used across all datasets
- Loss functions include L retrieve, L box, L mask, L boxinst mask, L pairwise, and L embed
- Transformer architecture is used to transform enhanced visual features and prompt features into final instance predictions
- Mask head is introduced for segmentation
- SimOTA is used to enable multiple queries to be matched with one GT
- IoU branch is added to reflect the quality of the predicted boxes
- Contrastive DN, mixed query selection, and look forward twice are used to improve performance
- UNINEXT outperforms other competitive counterparts on 10 tasks
- UNINEXT can accurately locate the target referred by language expressions and precisely track and segment the targets in complex scenarios