Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Box-supervised instance segmentation uses simple box annotations instead of pixel-wise mask labels.
  • Box2Mask is a novel single-shot instance segmentation approach that integrates level-set evolution into deep neural network learning.
  • Box2Mask consists of two types of single-stage frameworks, CNN-based and transformer-based.
  • Box2Mask achieves outstanding performance on five challenging testbeds.

Paper Content

Introduction

  • Instance segmentation aims to obtain pixel-wise mask labels of objects
  • Used in applications such as autonomous driving, robotic manipulation, image editing, cell segmentation
  • Recent advances in CNN and transformer architectures have improved instance segmentation
  • Existing methods require pixel-wise instance mask annotations, which is expensive and tedious
  • Box-supervised instance segmentation uses simple and label-efficient box annotations
  • Methods have been developed to enable pixel-wise supervision with box annotation
  • Recent approaches use pairwise affinity modeling to enable end-to-end training
  • Box2Mask proposed to use classical level-set model to model pixel affinities
  • Box2Mask uses low-level and high-level features to robustly evolve level-set curves
  • Box2Mask consists of level-set evolution, instance-aware decoder and box-level matching assignment
  • Box2Mask improves previous state-of-the-art 38.3% AP to 43.2% AP on Pascal VOC, 33.4% AP to 38.3% AP on COCO
  • Fully-supervised instance segmentation methods
  • Box-supervised instance segmentation methods
  • Level-set based segmentation methods

Fully-supervised instance segmentation

  • Instance segmentation is a computer vision task that generates a pixel-level mask with a category label for each instance in an image.
  • Existing works are divided into three categories: Mask R-CNN family, ROI-free approaches, and transformer-based methods.
  • Weakly supervised methods with label-efficient annotations are gaining attention.

Box-supervised instance segmentation

  • Box-supervised instance segmentation uses bounding box annotations to predict pixel-level masks
  • Multiple instance learning (MIL) problem is formulated using Mask R-CNN
  • BoxInst uses color-pairwise affinity with box constraint
  • DiscoBox uses intra-image and cross-image pairwise potential
  • BBAM and BoxCaseg use multiple training stages or extra saliency data for pseudo mask generation
  • Cheng et al. and Tang et al. use extra points as supervision within the bounding box
  • Proposed level-set based approach learns to evolve object boundary curves in an end-to-end manner

Level-set based image segmentation

  • Level-set methods are used in image segmentation
  • Level-set methods represent implicit curves with an energy function
  • Some works have been developed to embed level-set evolution into deep learning
  • Kim et al. [30] learned to perform level-set evolution in an unsupervised manner
  • Our proposed approach learns level-set evolution with only box annotations

Proposed method

Overview

  • Proposed Box2Mask method for box-supervised instance segmentation
  • Consists of a backbone, an instance-aware decoder (IAD), a box-level matching assignment step and a level-set evolution module
  • Backbone used to extract basic features
  • IAD generates instance-aware mask maps
  • Box-level matching assignment assigns high-quality mask map samples as positives
  • Level-set evolution module generates accurate supervisions with only bounding box annotations

Instance-aware decoder

  • IAD learns to embed unique characteristics of each instance to generate instance-aware mask map
  • IAD consists of pixel-wise decoder and kernel learning network
  • CNN-based IAD uses dynamic convolution method
  • Transformer-based IAD uses transformer decoder and multi-scale deformable attention transformer layers

Box-level matching assignment

  • Label assignment is important for training instance segmentation networks.
  • Different matching assignment techniques are used for CNN-based and transformer-based frameworks.
  • For CNN-based framework, a center sampling scheme is used to ensure each potential instance-aware map contains only one target instance.
  • For transformer-based framework, a Hungarian algorithm based bipartite matching assignment scheme is used.

Level-set evolution

  • Level-set model used for image segmentation
  • Formulation of Mumford-Shah level-set model
  • Chan-Vese level-set model simplified Mumford-Shah model
  • Level-set evolution used in Box2Mask method
  • Input image and deep features used as input data terms
  • Box projection function used to generate initial level-set
  • Affinity kernel function used to ensure local consistency of level-set
  • L1 distance used to make level-set more stable

Training loss and inference

  • Employ level-set energy as training objective for network optimization
  • Loss function consists of two items: category classification loss and instance segmentation loss
  • Inference process is direct and efficient without need of level-set evolution
  • Post-processing with matrix non-maximum suppression needed for model trained under CNN-based framework

Experiments

Datasets

  • Evaluated proposed Box2Mask approach on 5 challenging datasets
  • Used Pascal VOC 2012 dataset with 10,582 images for training and 1,449 images for evaluation
  • Used COCO dataset with 115K images for training and 5K and 20K images for evaluation
  • Used iSAID dataset with 1,411 images for training and 458 images for performance evaluation
  • Used LiTS dataset with 130 volume CT scans for training and 70 volume CT scans for testing
  • Used ICDAR2019 ReCTS dataset with 20K images for training and 5K images for testing

Implementation details

  • Models trained with AdamW optimizer on 8 NVIDIA V100 GPUs
  • mmdetection toolbox used with commonly used training settings
  • ResNet and Swin-Transformer used as backbones, pre-trained on ImageNet-1K
  • Initial learning rate 10-4, weight decay 0.1, 16 images per mini-batch
  • Training schedules of “1×” and “3×” same as mmdetection
  • Loss function set to α = 3.0
  • Initial learning rate 5 x 10-5, weight decay 0.05, 8 images per mini-batch
  • Train models for 50 epochs, large-scale jittering augmentation scheme used
  • Non-negative weight γ set to 10-4
  • λ 1 = 0.05, λ 2 = 5.0 in Eq. 9
  • Scale jitter used on COCO and Pascal VOC
  • Input size 800x800 on iSAID, 640x640 on LiTS
  • Training settings same as COCO on ICDAR 2019 ReCTS
  • Performance evaluated with COCO-style mask AP (%)

Main results

  • Proposed Box2Mask method compared to state-of-the-art box-supervised instance segmentation approaches
  • Results on Pascal VOC and COCO reported
  • Box2Mask-C outperforms DiscoBox by 0.8% AP on COCO val2017 split
  • Box2Mask-T achieves 36.1% mask AP with 3.9% improvement over Box2Mask-C
  • Box2Mask-C outperforms BoxInst and DiscoBox by 0.5% and 0.6% AP on COCO test2017 split
  • Box2Mask-C surpasses A2GNN, BBAM, BoxCaseg and BoxInst by 13.3%, 8.5%, 3.3% and 1.0% mask AP
  • Box2Mask-C achieves 24.3% AP on iSAID dataset, outperforming BoxInst and DiscoBox by 6.8% and 2.9%
  • Box2Mask-C outperforms SOLO by 3.1% AP on iSAID dataset
  • Box2Mask-C achieves 56.44% and 60.16% mAP on DOTA-v1.0 with 1× and 3× training schedules
  • Box2Mask-T gains 55.3% mask AP on LiTS dataset
  • Box2Mask-C obtains 44.6% box AP on ICDAR 2019 ReCTS, outperforming DiscoBox and BoxInst by 3.0% and 2.0% AP

Deep variational-based instance segmentation

  • Box2Mask-C achieves comparable results to fully supervised variational-based methods
  • Box2Mask-T outperforms Levelset R-CNN and DVIS-700 by 1.8% and 3.5% AP respectively with ResNet-50 backbone
  • Box2Mask-T achieves best result of 37.5% AP with ResNet-101, outperforming all fully supervised deep variational-based methods

Inference speed

  • Mask-supervised and box-supervised methods have been tested for inference speed and accuracy.
  • Box2Mask-C with ResNet-50 backbone has 11.5 FPS and 32.6% mask AP.
  • Box2Mask-T is slower than DiscoBox and BoxInst but has better accuracy.
  • Box2Mask-T with ResNet-101 has 7.9 FPS and 38.3% mask AP.

Ablation experiments

  • Level-set energy functional has an impact on the performance of Box2Mask-C
  • Using the box projection function as F φ 0 to initialize the boundary during training results in 27.1% AP
  • Using the original image I u as the input data term in Eq. 9 results in 30.6% AP
  • Using both original image and high-level features results in 33.8% AP
  • Using the global region with the full-image size results in a performance drop
  • Using tree filters with low-level and high-level features results in 3.0% AP improvement
  • Using the local consistency module with dilation rate 3 results in 36.3% AP
  • Longer training schedules benefit the proposed models
  • Best performance (39.4% AP) is obtained when β 1 = 2.0 and β 1 = 6.0
  • Best performance (39.4% AP) is achieved on Pascal VOC when the number of MSDeformAtten layers is 5

Conclusion

  • Box2Mask is a box-supervised instance segmentation approach
  • It uses a CNN-based and transformer-based architecture
  • Input image and deep high-level features are used as inputs
  • Box projection function is used to initialize the object boundary
  • Experiments conducted on five benchmarks
  • New state-of-the-arts on all datasets
  • Performance gap between fully mask-supervised and box-supervised approach narrowed
  • Instance segmentation with simple bounding box annotations more practical
  • Kernel learning network used in framework
  • Visualization of instance segmentation results on COCO, iSAID, LiTS, and ICDAR 2019 ReCTS
  • Performance comparison on Pascal VOC val and COCO test-dev
  • Box2Mask-C outperforms BoxInst by 3.7% and 3.1% mask AP with ResNet-50 and ResNet-101 backbones
  • Box2Mask-C achieves 77.3% and 66.6% AP 25 and AP 50
  • Box2Mask-C achieves 40.9% AP 75 with ResNet-101
  • Box2Mask-T achieves much higher performance than Box2Mask-C
  • Box2Mask-T outperforms popular fully mask-supervised methods
  • Instance segmentation results on remote sensing image dataset iSAID val
  • Oriented object detection performance on DOTA-v1.0 test
  • Mask AP results on Pascal VOC val
  • Instance segmentation results on medical image dataset LiTS val
  • Object detection performance on ICDAR2019 ReCTS test dataset
  • Deep variational instance segmentation results on COCO val
  • Mask AP and inference speed of Box2Mask
  • Effectiveness of deep structural features with different input guidance of tree filter
  • Effectiveness of local consistency module (LCM) using different dilation rates
  • Training schedules for Box2Mask-C and Box2Mask-T models
  • Impact of balance weights β 1 and β 2 in matching cost for the box-level matching assignment in Box2Mask-T
  • Effectiveness of the number of MSDeformAtten layers in the pixel decoder of Box2Mask-T