Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Box-supervised instance segmentation uses simple box annotations instead of pixel-wise mask labels.
Box2Mask is a novel single-shot instance segmentation approach that integrates level-set evolution into deep neural network learning.
Box2Mask consists of two types of single-stage frameworks, CNN-based and transformer-based.
Box2Mask achieves outstanding performance on five challenging testbeds.

Paper Content

Introduction

Instance segmentation aims to obtain pixel-wise mask labels of objects
Used in applications such as autonomous driving, robotic manipulation, image editing, cell segmentation
Recent advances in CNN and transformer architectures have improved instance segmentation
Existing methods require pixel-wise instance mask annotations, which is expensive and tedious
Box-supervised instance segmentation uses simple and label-efficient box annotations
Methods have been developed to enable pixel-wise supervision with box annotation
Recent approaches use pairwise affinity modeling to enable end-to-end training
Box2Mask proposed to use classical level-set model to model pixel affinities
Box2Mask uses low-level and high-level features to robustly evolve level-set curves
Box2Mask consists of level-set evolution, instance-aware decoder and box-level matching assignment
Box2Mask improves previous state-of-the-art 38.3% AP to 43.2% AP on Pascal VOC, 33.4% AP to 38.3% AP on COCO

Fully-supervised instance segmentation methods
Box-supervised instance segmentation methods
Level-set based segmentation methods

Fully-supervised instance segmentation

Instance segmentation is a computer vision task that generates a pixel-level mask with a category label for each instance in an image.
Existing works are divided into three categories: Mask R-CNN family, ROI-free approaches, and transformer-based methods.
Weakly supervised methods with label-efficient annotations are gaining attention.

Box-supervised instance segmentation

Box-supervised instance segmentation uses bounding box annotations to predict pixel-level masks
Multiple instance learning (MIL) problem is formulated using Mask R-CNN
BoxInst uses color-pairwise affinity with box constraint
DiscoBox uses intra-image and cross-image pairwise potential
BBAM and BoxCaseg use multiple training stages or extra saliency data for pseudo mask generation
Cheng et al. and Tang et al. use extra points as supervision within the bounding box
Proposed level-set based approach learns to evolve object boundary curves in an end-to-end manner

Level-set based image segmentation

Level-set methods are used in image segmentation
Level-set methods represent implicit curves with an energy function
Some works have been developed to embed level-set evolution into deep learning
Kim et al. [30] learned to perform level-set evolution in an unsupervised manner
Our proposed approach learns level-set evolution with only box annotations

Proposed method

Overview

Proposed Box2Mask method for box-supervised instance segmentation
Consists of a backbone, an instance-aware decoder (IAD), a box-level matching assignment step and a level-set evolution module
Backbone used to extract basic features
IAD generates instance-aware mask maps
Box-level matching assignment assigns high-quality mask map samples as positives
Level-set evolution module generates accurate supervisions with only bounding box annotations

Instance-aware decoder

IAD learns to embed unique characteristics of each instance to generate instance-aware mask map
IAD consists of pixel-wise decoder and kernel learning network
CNN-based IAD uses dynamic convolution method
Transformer-based IAD uses transformer decoder and multi-scale deformable attention transformer layers

Box-level matching assignment

Label assignment is important for training instance segmentation networks.
Different matching assignment techniques are used for CNN-based and transformer-based frameworks.
For CNN-based framework, a center sampling scheme is used to ensure each potential instance-aware map contains only one target instance.
For transformer-based framework, a Hungarian algorithm based bipartite matching assignment scheme is used.

Level-set evolution

Level-set model used for image segmentation
Formulation of Mumford-Shah level-set model
Chan-Vese level-set model simplified Mumford-Shah model
Level-set evolution used in Box2Mask method
Input image and deep features used as input data terms
Box projection function used to generate initial level-set
Affinity kernel function used to ensure local consistency of level-set
L1 distance used to make level-set more stable

Training loss and inference

Employ level-set energy as training objective for network optimization
Loss function consists of two items: category classification loss and instance segmentation loss
Inference process is direct and efficient without need of level-set evolution
Post-processing with matrix non-maximum suppression needed for model trained under CNN-based framework

Experiments

Datasets

Evaluated proposed Box2Mask approach on 5 challenging datasets
Used Pascal VOC 2012 dataset with 10,582 images for training and 1,449 images for evaluation
Used COCO dataset with 115K images for training and 5K and 20K images for evaluation
Used iSAID dataset with 1,411 images for training and 458 images for performance evaluation
Used LiTS dataset with 130 volume CT scans for training and 70 volume CT scans for testing
Used ICDAR2019 ReCTS dataset with 20K images for training and 5K images for testing

Implementation details

Models trained with AdamW optimizer on 8 NVIDIA V100 GPUs
mmdetection toolbox used with commonly used training settings
ResNet and Swin-Transformer used as backbones, pre-trained on ImageNet-1K
Initial learning rate 10-4, weight decay 0.1, 16 images per mini-batch
Training schedules of “1×” and “3×” same as mmdetection
Loss function set to α = 3.0
Initial learning rate 5 x 10-5, weight decay 0.05, 8 images per mini-batch
Train models for 50 epochs, large-scale jittering augmentation scheme used
Non-negative weight γ set to 10-4
λ 1 = 0.05, λ 2 = 5.0 in Eq. 9
Scale jitter used on COCO and Pascal VOC
Input size 800x800 on iSAID, 640x640 on LiTS
Training settings same as COCO on ICDAR 2019 ReCTS
Performance evaluated with COCO-style mask AP (%)

Main results

Proposed Box2Mask method compared to state-of-the-art box-supervised instance segmentation approaches
Results on Pascal VOC and COCO reported
Box2Mask-C outperforms DiscoBox by 0.8% AP on COCO val2017 split
Box2Mask-T achieves 36.1% mask AP with 3.9% improvement over Box2Mask-C
Box2Mask-C outperforms BoxInst and DiscoBox by 0.5% and 0.6% AP on COCO test2017 split
Box2Mask-C surpasses A2GNN, BBAM, BoxCaseg and BoxInst by 13.3%, 8.5%, 3.3% and 1.0% mask AP
Box2Mask-C achieves 24.3% AP on iSAID dataset, outperforming BoxInst and DiscoBox by 6.8% and 2.9%
Box2Mask-C outperforms SOLO by 3.1% AP on iSAID dataset
Box2Mask-C achieves 56.44% and 60.16% mAP on DOTA-v1.0 with 1× and 3× training schedules
Box2Mask-T gains 55.3% mask AP on LiTS dataset
Box2Mask-C obtains 44.6% box AP on ICDAR 2019 ReCTS, outperforming DiscoBox and BoxInst by 3.0% and 2.0% AP

Deep variational-based instance segmentation

Box2Mask-C achieves comparable results to fully supervised variational-based methods
Box2Mask-T outperforms Levelset R-CNN and DVIS-700 by 1.8% and 3.5% AP respectively with ResNet-50 backbone
Box2Mask-T achieves best result of 37.5% AP with ResNet-101, outperforming all fully supervised deep variational-based methods

Inference speed

Mask-supervised and box-supervised methods have been tested for inference speed and accuracy.
Box2Mask-C with ResNet-50 backbone has 11.5 FPS and 32.6% mask AP.
Box2Mask-T is slower than DiscoBox and BoxInst but has better accuracy.
Box2Mask-T with ResNet-101 has 7.9 FPS and 38.3% mask AP.

Ablation experiments

Level-set energy functional has an impact on the performance of Box2Mask-C
Using the box projection function as F φ 0 to initialize the boundary during training results in 27.1% AP
Using the original image I u as the input data term in Eq. 9 results in 30.6% AP
Using both original image and high-level features results in 33.8% AP
Using the global region with the full-image size results in a performance drop
Using tree filters with low-level and high-level features results in 3.0% AP improvement
Using the local consistency module with dilation rate 3 results in 36.3% AP
Longer training schedules benefit the proposed models
Best performance (39.4% AP) is obtained when β 1 = 2.0 and β 1 = 6.0
Best performance (39.4% AP) is achieved on Pascal VOC when the number of MSDeformAtten layers is 5

Conclusion

Box2Mask is a box-supervised instance segmentation approach
It uses a CNN-based and transformer-based architecture
Input image and deep high-level features are used as inputs
Box projection function is used to initialize the object boundary
Experiments conducted on five benchmarks
New state-of-the-arts on all datasets
Performance gap between fully mask-supervised and box-supervised approach narrowed
Instance segmentation with simple bounding box annotations more practical
Kernel learning network used in framework
Visualization of instance segmentation results on COCO, iSAID, LiTS, and ICDAR 2019 ReCTS
Performance comparison on Pascal VOC val and COCO test-dev
Box2Mask-C outperforms BoxInst by 3.7% and 3.1% mask AP with ResNet-50 and ResNet-101 backbones
Box2Mask-C achieves 77.3% and 66.6% AP 25 and AP 50
Box2Mask-C achieves 40.9% AP 75 with ResNet-101
Box2Mask-T achieves much higher performance than Box2Mask-C
Box2Mask-T outperforms popular fully mask-supervised methods
Instance segmentation results on remote sensing image dataset iSAID val
Oriented object detection performance on DOTA-v1.0 test
Mask AP results on Pascal VOC val
Instance segmentation results on medical image dataset LiTS val
Object detection performance on ICDAR2019 ReCTS test dataset
Deep variational instance segmentation results on COCO val
Mask AP and inference speed of Box2Mask
Effectiveness of deep structural features with different input guidance of tree filter
Effectiveness of local consistency module (LCM) using different dilation rates
Training schedules for Box2Mask-C and Box2Mask-T models
Impact of balance weights β 1 and β 2 in matching cost for the box-level matching assignment in Box2Mask-T
Effectiveness of the number of MSDeformAtten layers in the pixel decoder of Box2Mask-T

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Fully-supervised instance segmentation#

Box-supervised instance segmentation#

Level-set based image segmentation#

Proposed method#

Overview#

Instance-aware decoder#

Box-level matching assignment#

Level-set evolution#

Training loss and inference#

Experiments#

Datasets#

Implementation details#

Main results#

Deep variational-based instance segmentation#

Inference speed#

Ablation experiments#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Fully-supervised instance segmentation

Box-supervised instance segmentation

Level-set based image segmentation

Proposed method

Overview

Instance-aware decoder

Box-level matching assignment

Level-set evolution

Training loss and inference

Experiments

Datasets

Implementation details

Main results

Deep variational-based instance segmentation

Inference speed

Ablation experiments

Conclusion