Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • IMAS is a method for segmenting primary objects in videos without manual annotation.
  • IMAS uses motion-appearance synergy to deal with motion-appearance conflicts.
  • IMAS has two training stages: motion-supervised object discovery and refinement.
  • IMAS proposes motion-semantic alignment as a model-agnostic annotation-free hyperparam tuning method.
  • IMAS improves segmentation quality on several UVOS benchmarks.

Paper Content

Introduction

  • Video object segmentation is a widely researched topic in computer vision.
  • Many algorithms require manual annotation which is costly to obtain.
  • Unsupervised video object segmentation (UVOS) has gained popularity in recent years.

Method

  • IMAS consists of two stages: motion-supervised object discovery and appearancesupervised refinement
  • Neither stage requires human annotation, making IMAS fully unsupervised
  • Sec. 3.1 presents the problem setting, Sec. 3.2 and 3.3 describe contributions to the first and second stages, and Sec. 3.4 presents motion-semantic alignment as a model-agnostic unsupervised hyperparam tuner
  • UVOS requires segmenting objects from video sequences without human annotation
  • Previous UVOS methods require supervised training with manual annotation
  • Focus on fully unsupervised VOS, no human annotation needed
  • Motion segmentation uses motion information to separate foreground and background
  • Appearance-based UVOS methods use raw images as input to detect objects without relative motion

Problem setting

  • Let I t ∈ R 3×h×w be the t th frame from a sequence of T RGB frames
  • Objective of UVOS is a binary segmentation mask M t ∈ {0, 1} h×w for each timestep t
  • Mean Jaccard index J (i.e., mean IoU) between predicted segmentation mask M t and ground truth G t
  • No human-annotated data used throughout training and inference
  • Use off-the-shelf optical flow model RAFT to provide motion cues between consecutive frames
  • Piecewise constant pathway and learnable residual pathway to form the final flow prediction Ft
  • Training objective of stage 1 is to minimize the L1 loss between the predicted reconstruction flow Ft and target flow F t
  • Appearance supervision with low-level cues and semantic constraint
  • Motion-semantic alignment metric to quantify segmentation quality
  • Select hyperparam values with highest mean IoU

Methods

Experiments

Datasets

  • Evaluated methods on 3 datasets commonly used to benchmark UVOS
  • Dataset DAVIS2016 contains 50 video sequences of 3,455 frames
  • Performance evaluated on validation set of 20 videos with annotation at 480p resolution
  • SegTrackv2 contains 14 videos of different resolutions with 976 annotated frames
  • FBMS59 contains 59 videos with 13,860 frames and 720 frames annotated with a fixed interval
  • Merged multiple foreground objects in STv2 and FBMS59 into one mask
  • Trained on all unlabeled videos
  • Evaluation metric is mean Jac-card index J (mIoU)

Unsupervised video object segmentation

  • Uses ResNet50 backbone and two heads with 3 Conv-BN-ReLU layers
  • Object channels determined without human annotation
  • RAFT model trained on synthetic datasets without human annotation
  • IMAS outperforms previous methods by a large margin
  • IMAS surpasses previous state-of-the-art method by 5.6% without post-processing

Motion-semantic alignment for hyperparam tuning

  • Motion-semantic alignment is used to tune two hyperparameters
  • Increasing the number of channels improves segmentation quality, but saturates at C = 4
  • Object channel index c o needs to be obtained at the end of each training run
  • Leverage redundancy in video sequences and use only first frame of each video sequence to find c o

Ablation study

  • Residual pathway contributes 5.4%
  • Feature merging contributes 3.4%
  • Explicit appearance refinement boosts performance by 10.1%
  • CRF post-processing boosts performance by 12.2%

Visualizations and discussions

  • IMAS is compared to [26,50]
  • IMAS can handle complex foreground motion, distracting background motion, and camera motion
  • IMAS has limitations, such as not working when neither motion nor appearance provides informative signals
  • IMAS can recognize multiple foreground objects when they move in sync, but sometimes only captures one when they move differently
  • IMAS is not designed to separate multiple foreground objects

Summary

  • Presents IMAS, an unsupervised video object segmentation method
  • Leverages motion-appearance synergy
  • Object discovery stage with conflict-resolving learnable residual pathway
  • Refinement stage with appearance supervision
  • Motion-semantic alignment as annotation-free hyperparam tuning method

Applying motion-semantic alignment on previous work

  • Our hyperparam tuning method is model-agnostic.
  • Our method finds the same optimal number of channels and object channel index as using the validation set performance with human annotation.

Additional implementation details

  • Treat video frame pair {t, t + 1} as both a forward and backward action
  • Use symmetric loss to apply loss function both forward and backward
  • Use random crop augmentation at training and original image at test
  • Resize images for STv2 and FBMS59
  • Use pixel-wise photo-metric transformation for augmentation
  • Change head of ResNet by fusing feature from first and third residual block
  • Load self-supervised ImageNet pretrained weights
  • Model non-uniform 2D flow from object rotation in 3D
  • Capture multiple objects in a foreground group
  • Robust to misleading common motion and camera motion
  • Select to focus on one of the foreground objects if one has significantly larger motion
  • Tuned hyperparams from unsupervised motion-semantic alignment
  • Discard potential constraint if match is greater than 80% of width or 90% of height

Per-sequence results

  • Results on DAVIS16 listed in Table 6
  • Results on STv2 listed in Table 7
  • Results on FBMS59 listed in Table 8

Future directions

  • Method does not leverage temporal consistency
  • Could be more robust by using neighboring frames
  • Temporal consistency measures could be taken care of with additional loss term or post-processing
  • Does not support segmenting multiple parts of the foreground
  • Generates high-quality segmentation and is robust to uninformative or misleading motion signals
  • Framework has a motion supervision module and an appearance-based refinement stage
  • Semantic constraint mitigates false positives from misleading motion signals
  • Selects objectness channel with motion-semantic alignment
  • Tuned hyperparams from motion-semantic alignment align with ones from human annotation
  • Robust to uninformative or misleading motion cues
  • Model agnostic hyperparam tuning technique