Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Box-supervised instance segmentation uses simple box annotations instead of pixel-wise mask labels.
- Box2Mask is a novel single-shot instance segmentation approach that integrates level-set evolution into deep neural network learning.
- Box2Mask consists of two types of single-stage frameworks, CNN-based and transformer-based.
- Box2Mask achieves outstanding performance on five challenging testbeds.
Paper Content
Introduction
- Instance segmentation aims to obtain pixel-wise mask labels of objects
- Used in applications such as autonomous driving, robotic manipulation, image editing, cell segmentation
- Recent advances in CNN and transformer architectures have improved instance segmentation
- Existing methods require pixel-wise instance mask annotations, which is expensive and tedious
- Box-supervised instance segmentation uses simple and label-efficient box annotations
- Methods have been developed to enable pixel-wise supervision with box annotation
- Recent approaches use pairwise affinity modeling to enable end-to-end training
- Box2Mask proposed to use classical level-set model to model pixel affinities
- Box2Mask uses low-level and high-level features to robustly evolve level-set curves
- Box2Mask consists of level-set evolution, instance-aware decoder and box-level matching assignment
- Box2Mask improves previous state-of-the-art 38.3% AP to 43.2% AP on Pascal VOC, 33.4% AP to 38.3% AP on COCO
Related work
- Fully-supervised instance segmentation methods
- Box-supervised instance segmentation methods
- Level-set based segmentation methods
Fully-supervised instance segmentation
- Instance segmentation is a computer vision task that generates a pixel-level mask with a category label for each instance in an image.
- Existing works are divided into three categories: Mask R-CNN family, ROI-free approaches, and transformer-based methods.
- Weakly supervised methods with label-efficient annotations are gaining attention.
Box-supervised instance segmentation
- Box-supervised instance segmentation uses bounding box annotations to predict pixel-level masks
- Multiple instance learning (MIL) problem is formulated using Mask R-CNN
- BoxInst uses color-pairwise affinity with box constraint
- DiscoBox uses intra-image and cross-image pairwise potential
- BBAM and BoxCaseg use multiple training stages or extra saliency data for pseudo mask generation
- Cheng et al. and Tang et al. use extra points as supervision within the bounding box
- Proposed level-set based approach learns to evolve object boundary curves in an end-to-end manner
Level-set based image segmentation
- Level-set methods are used in image segmentation
- Level-set methods represent implicit curves with an energy function
- Some works have been developed to embed level-set evolution into deep learning
- Kim et al. [30] learned to perform level-set evolution in an unsupervised manner
- Our proposed approach learns level-set evolution with only box annotations
Proposed method
Overview
- Proposed Box2Mask method for box-supervised instance segmentation
- Consists of a backbone, an instance-aware decoder (IAD), a box-level matching assignment step and a level-set evolution module
- Backbone used to extract basic features
- IAD generates instance-aware mask maps
- Box-level matching assignment assigns high-quality mask map samples as positives
- Level-set evolution module generates accurate supervisions with only bounding box annotations
Instance-aware decoder
- IAD learns to embed unique characteristics of each instance to generate instance-aware mask map
- IAD consists of pixel-wise decoder and kernel learning network
- CNN-based IAD uses dynamic convolution method
- Transformer-based IAD uses transformer decoder and multi-scale deformable attention transformer layers
Box-level matching assignment
- Label assignment is important for training instance segmentation networks.
- Different matching assignment techniques are used for CNN-based and transformer-based frameworks.
- For CNN-based framework, a center sampling scheme is used to ensure each potential instance-aware map contains only one target instance.
- For transformer-based framework, a Hungarian algorithm based bipartite matching assignment scheme is used.
Level-set evolution
- Level-set model used for image segmentation
- Formulation of Mumford-Shah level-set model
- Chan-Vese level-set model simplified Mumford-Shah model
- Level-set evolution used in Box2Mask method
- Input image and deep features used as input data terms
- Box projection function used to generate initial level-set
- Affinity kernel function used to ensure local consistency of level-set
- L1 distance used to make level-set more stable
Training loss and inference
- Employ level-set energy as training objective for network optimization
- Loss function consists of two items: category classification loss and instance segmentation loss
- Inference process is direct and efficient without need of level-set evolution
- Post-processing with matrix non-maximum suppression needed for model trained under CNN-based framework
Experiments
Datasets
- Evaluated proposed Box2Mask approach on 5 challenging datasets
- Used Pascal VOC 2012 dataset with 10,582 images for training and 1,449 images for evaluation
- Used COCO dataset with 115K images for training and 5K and 20K images for evaluation
- Used iSAID dataset with 1,411 images for training and 458 images for performance evaluation
- Used LiTS dataset with 130 volume CT scans for training and 70 volume CT scans for testing
- Used ICDAR2019 ReCTS dataset with 20K images for training and 5K images for testing
Implementation details
- Models trained with AdamW optimizer on 8 NVIDIA V100 GPUs
- mmdetection toolbox used with commonly used training settings
- ResNet and Swin-Transformer used as backbones, pre-trained on ImageNet-1K
- Initial learning rate 10-4, weight decay 0.1, 16 images per mini-batch
- Training schedules of “1×” and “3×” same as mmdetection
- Loss function set to α = 3.0
- Initial learning rate 5 x 10-5, weight decay 0.05, 8 images per mini-batch
- Train models for 50 epochs, large-scale jittering augmentation scheme used
- Non-negative weight γ set to 10-4
- λ 1 = 0.05, λ 2 = 5.0 in Eq. 9
- Scale jitter used on COCO and Pascal VOC
- Input size 800x800 on iSAID, 640x640 on LiTS
- Training settings same as COCO on ICDAR 2019 ReCTS
- Performance evaluated with COCO-style mask AP (%)
Main results
- Proposed Box2Mask method compared to state-of-the-art box-supervised instance segmentation approaches
- Results on Pascal VOC and COCO reported
- Box2Mask-C outperforms DiscoBox by 0.8% AP on COCO val2017 split
- Box2Mask-T achieves 36.1% mask AP with 3.9% improvement over Box2Mask-C
- Box2Mask-C outperforms BoxInst and DiscoBox by 0.5% and 0.6% AP on COCO test2017 split
- Box2Mask-C surpasses A2GNN, BBAM, BoxCaseg and BoxInst by 13.3%, 8.5%, 3.3% and 1.0% mask AP
- Box2Mask-C achieves 24.3% AP on iSAID dataset, outperforming BoxInst and DiscoBox by 6.8% and 2.9%
- Box2Mask-C outperforms SOLO by 3.1% AP on iSAID dataset
- Box2Mask-C achieves 56.44% and 60.16% mAP on DOTA-v1.0 with 1× and 3× training schedules
- Box2Mask-T gains 55.3% mask AP on LiTS dataset
- Box2Mask-C obtains 44.6% box AP on ICDAR 2019 ReCTS, outperforming DiscoBox and BoxInst by 3.0% and 2.0% AP
Deep variational-based instance segmentation
- Box2Mask-C achieves comparable results to fully supervised variational-based methods
- Box2Mask-T outperforms Levelset R-CNN and DVIS-700 by 1.8% and 3.5% AP respectively with ResNet-50 backbone
- Box2Mask-T achieves best result of 37.5% AP with ResNet-101, outperforming all fully supervised deep variational-based methods
Inference speed
- Mask-supervised and box-supervised methods have been tested for inference speed and accuracy.
- Box2Mask-C with ResNet-50 backbone has 11.5 FPS and 32.6% mask AP.
- Box2Mask-T is slower than DiscoBox and BoxInst but has better accuracy.
- Box2Mask-T with ResNet-101 has 7.9 FPS and 38.3% mask AP.
Ablation experiments
- Level-set energy functional has an impact on the performance of Box2Mask-C
- Using the box projection function as F φ 0 to initialize the boundary during training results in 27.1% AP
- Using the original image I u as the input data term in Eq. 9 results in 30.6% AP
- Using both original image and high-level features results in 33.8% AP
- Using the global region with the full-image size results in a performance drop
- Using tree filters with low-level and high-level features results in 3.0% AP improvement
- Using the local consistency module with dilation rate 3 results in 36.3% AP
- Longer training schedules benefit the proposed models
- Best performance (39.4% AP) is obtained when β 1 = 2.0 and β 1 = 6.0
- Best performance (39.4% AP) is achieved on Pascal VOC when the number of MSDeformAtten layers is 5
Conclusion
- Box2Mask is a box-supervised instance segmentation approach
- It uses a CNN-based and transformer-based architecture
- Input image and deep high-level features are used as inputs
- Box projection function is used to initialize the object boundary
- Experiments conducted on five benchmarks
- New state-of-the-arts on all datasets
- Performance gap between fully mask-supervised and box-supervised approach narrowed
- Instance segmentation with simple bounding box annotations more practical
- Kernel learning network used in framework
- Visualization of instance segmentation results on COCO, iSAID, LiTS, and ICDAR 2019 ReCTS
- Performance comparison on Pascal VOC val and COCO test-dev
- Box2Mask-C outperforms BoxInst by 3.7% and 3.1% mask AP with ResNet-50 and ResNet-101 backbones
- Box2Mask-C achieves 77.3% and 66.6% AP 25 and AP 50
- Box2Mask-C achieves 40.9% AP 75 with ResNet-101
- Box2Mask-T achieves much higher performance than Box2Mask-C
- Box2Mask-T outperforms popular fully mask-supervised methods
- Instance segmentation results on remote sensing image dataset iSAID val
- Oriented object detection performance on DOTA-v1.0 test
- Mask AP results on Pascal VOC val
- Instance segmentation results on medical image dataset LiTS val
- Object detection performance on ICDAR2019 ReCTS test dataset
- Deep variational instance segmentation results on COCO val
- Mask AP and inference speed of Box2Mask
- Effectiveness of deep structural features with different input guidance of tree filter
- Effectiveness of local consistency module (LCM) using different dilation rates
- Training schedules for Box2Mask-C and Box2Mask-T models
- Impact of balance weights β 1 and β 2 in matching cost for the box-level matching assignment in Box2Mask-T
- Effectiveness of the number of MSDeformAtten layers in the pixel decoder of Box2Mask-T