Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • UNINEXT is a universal instance perception model for object discovery and retrieval.
  • Benefits of UNINEXT include exploiting data from different tasks and label vocabularies for joint training of general instance-level representations, and being parameter-efficient when handling multiple tasks.
  • UNINEXT has shown superior performance on 20 challenging benchmarks from 10 instance-level tasks.

Paper Content

Introduction

  • Object-centric understanding is a challenging problem in computer vision
  • 10 sub-tasks are discussed, distributed on the vertices of a cube
  • Object detection and instance segmentation require finding objects of specific categories
  • Multiple Object Tracking, Multi-Object Tracking and Segmentation, and Video Instance Segmentation require finding object trajectories of specific categories in videos
  • Referring Expression Comprehension, Referring Expression Segmentation, and Referring Video Object Segmentation aim to find objects matched with language expressions
  • Single Object Tracking and Video Object Segmentation take the target annotations given in the first frame as the reference
  • Fragmented task definitions split the field into pieces, causing redundant parameters and overlooking the possibility of mutual collaboration
  • UNINEXT is proposed as a universal instance perception model of the next generation
  • UNINEXT can flexibly perceive different instances by changing the input prompts
  • UNINEXT achieves superior performance on 20 challenging benchmarks
  • Retrieval by Category Names: Object detection and instance segmentation
  • Retrieval by Language Expressions: REC, RES, and R-VOS
  • Retrieval by Reference Annotations: SOT and VOS
  • Unified Vision Models: Unified learning paradigms and unified model architectures
  • Object detection and instance segmentation are foundations for other instance perception tasks
  • REC methods divided into two-stage, one-stage, and Transformer-based
  • RES approaches focus on designing diverse attention mechanisms
  • R-VOS is an extension of RES from images to videos
  • SOT and VOS extract target features and fuse target information with representations of the current frame
  • Unified vision models attempt to solve multiple vision or multi-modal tasks by a single model
  • Unified learning paradigms cover many tasks and modalities
  • Unified model architectures designed for a group of closely related tasks

Approach

  • Categorize existing instance perception tasks into three classes
  • Object detection, instance segmentation, MOT, MOTS, and VIS use category names as prompts
  • REC, RES, and R-VOS use an expression as the prompt
  • SOT and VOS use annotation given in the first frame as the prompt
  • Reformulate all instance perception tasks into a prompt-guided object discovery and retrieval problem
  • UNINEXT consists of three components: prompt generation, image-prompt feature fusion, object discovery and retrieval

Prompt generation

  • A prompt generation module is used to transform the original diverse prompt inputs into a unified form.
  • A language encoder is used to deal with language-related prompts.
  • An additional reference visual encoder is used to extract fine-grained visual features for annotation-guided tasks.
  • A merging module is applied to keep fine target information and get the prompt embedding in the same format.

Image-prompt feature fusion

  • The whole current image is passed through a visual encoder to obtain hierarchical visual features.
  • An early fusion module is used to enhance the original prompt embedding by the image contexts and make the original visual features prompt-aware.
  • A bi-directional cross-attention module is used to retrieve information from different inputs.
  • The retrieved representations are added to the original features.

Object discovery and retrieval

  • UNINEXT adopts encoder-decoder architecture for flexible query-to-instance fashion
  • Transformer encoder takes hierarchical prompt-aware visual features as inputs
  • Multi-scale Deformable Self-Attention used to exchange target information from different scales
  • Auxiliary prediction head generates initial reference points
  • Transformer decoder takes enhanced multi-scale features, reference points, and object queries as inputs
  • Two query generation strategies: static and dynamic queries
  • Instance head produces boxes and masks of targets
  • Embedding head associates current detected results with previous trajectories
  • Instance-prompt matching scores supervised by Focal Loss

Training and inference

  • Training process consists of 3 stages
  • First stage pretrains UNINEXT on Objects365
  • Second stage finetunes UNINEXT on COCO and mixed dataset
  • Third stage finetunes UNINEXT on video-level datasets
  • Inference is online and post-processing free

Experiments

Implementation details

  • Three different backbones used as visual encoder
  • BERT used as text encoder, parameters trained in first and second stages, frozen in last stage
  • Transformer encoder-decoder architecture with 6 encoder and 6 decoder layers
  • Number of object queries set to 900
  • AdamW optimizer with weight decay of 0.05
  • Model trained on 32 and 16 A100 GPUs

Evaluations on 10 tasks

  • Compared UNINEXT with task-specific counterparts in 20 datasets
  • Used same model parameters for all benchmarks
  • Compared UNINEXT with state-of-the-art object detection and instance segmentation methods
  • UNINEXT surpassed state-of-the-art query-based detector DN-Deformable DETR by 2.7 box AP
  • UNINEXT achieved a box AP of 58.1 and 60.6, surpassing Cascade Mask-RCNN and ViTDet-H by 3.3 and 1.9 respectively
  • UNINEXT outperformed state-of-the-art QueryInst by 4.3 AP and 6.2 AP L
  • UNINEXT obtained new state-of-the-art results, exceeding the previous best method by a large margin
  • Compared UNINEXT with state-of-the-art SOT methods on four large-scale benchmarks
  • UNINEXT achieved the best results in terms of AUC and P
  • Compared UNINEXT with previous semi-supervised VOS methods
  • UNINEXT achieved the best results among all non-memory-based methods
  • Compared UNINEXT with state-of-the-art MOT methods
  • UNINEXT surpassed Unicorn by 3.0 mMOTA and 2.7 mIDF1
  • Compared UNINEXT against state-of-the-art VIS methods
  • UNINEXT obtained the best results on both datasets

Ablations and other analysis

  • Evaluated on five benchmarks from five tasks
  • Early fusion has greatest impact on VOS
  • Removal of feature fusion causes performance drop on REC and RVOS
  • Feature fusion has minimum influence on object detection and VIS
  • Dynamic queries perform slightly better than static queries on first four tasks
  • Static queries outperform dynamic ones on VIS task
  • Unified model achieves significantly better performance than task-specific counterparts on five tasks

Conclusions

  • We propose UNINEXT, a universal instance perception model of the next generation
  • UNINEXT unifies 10 instance perception tasks with a prompt-guided object discovery and retrieval paradigm
  • Experiments demonstrate that UNINEXT achieves superior performance on 20 challenging benchmarks
  • Training process consists of three stages with StepLR learning rate scheduler
  • Multi-scale training technique is used across all datasets
  • Loss functions include L retrieve, L box, L mask, L boxinst mask, L pairwise, and L embed
  • Transformer architecture is used to transform enhanced visual features and prompt features into final instance predictions
  • Mask head is introduced for segmentation
  • SimOTA is used to enable multiple queries to be matched with one GT
  • IoU branch is added to reflect the quality of the predicted boxes
  • Contrastive DN, mixed query selection, and look forward twice are used to improve performance
  • UNINEXT outperforms other competitive counterparts on 10 tasks
  • UNINEXT can accurately locate the target referred by language expressions and precisely track and segment the targets in complex scenarios