Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

IMAS is a method for segmenting primary objects in videos without manual annotation.
IMAS uses motion-appearance synergy to deal with motion-appearance conflicts.
IMAS has two training stages: motion-supervised object discovery and refinement.
IMAS proposes motion-semantic alignment as a model-agnostic annotation-free hyperparam tuning method.
IMAS improves segmentation quality on several UVOS benchmarks.

Paper Content

Introduction

Video object segmentation is a widely researched topic in computer vision.
Many algorithms require manual annotation which is costly to obtain.
Unsupervised video object segmentation (UVOS) has gained popularity in recent years.

Method

IMAS consists of two stages: motion-supervised object discovery and appearancesupervised refinement
Neither stage requires human annotation, making IMAS fully unsupervised
Sec. 3.1 presents the problem setting, Sec. 3.2 and 3.3 describe contributions to the first and second stages, and Sec. 3.4 presents motion-semantic alignment as a model-agnostic unsupervised hyperparam tuner

UVOS requires segmenting objects from video sequences without human annotation
Previous UVOS methods require supervised training with manual annotation
Focus on fully unsupervised VOS, no human annotation needed
Motion segmentation uses motion information to separate foreground and background
Appearance-based UVOS methods use raw images as input to detect objects without relative motion

Problem setting

Let I t ∈ R 3×h×w be the t th frame from a sequence of T RGB frames
Objective of UVOS is a binary segmentation mask M t ∈ {0, 1} h×w for each timestep t
Mean Jaccard index J (i.e., mean IoU) between predicted segmentation mask M t and ground truth G t
No human-annotated data used throughout training and inference
Use off-the-shelf optical flow model RAFT to provide motion cues between consecutive frames
Piecewise constant pathway and learnable residual pathway to form the final flow prediction Ft
Training objective of stage 1 is to minimize the L1 loss between the predicted reconstruction flow Ft and target flow F t
Appearance supervision with low-level cues and semantic constraint
Motion-semantic alignment metric to quantify segmentation quality
Select hyperparam values with highest mean IoU

Methods

Experiments

Datasets

Evaluated methods on 3 datasets commonly used to benchmark UVOS
Dataset DAVIS2016 contains 50 video sequences of 3,455 frames
Performance evaluated on validation set of 20 videos with annotation at 480p resolution
SegTrackv2 contains 14 videos of different resolutions with 976 annotated frames
FBMS59 contains 59 videos with 13,860 frames and 720 frames annotated with a fixed interval
Merged multiple foreground objects in STv2 and FBMS59 into one mask
Trained on all unlabeled videos
Evaluation metric is mean Jac-card index J (mIoU)

Unsupervised video object segmentation

Uses ResNet50 backbone and two heads with 3 Conv-BN-ReLU layers
Object channels determined without human annotation
RAFT model trained on synthetic datasets without human annotation
IMAS outperforms previous methods by a large margin
IMAS surpasses previous state-of-the-art method by 5.6% without post-processing

Motion-semantic alignment for hyperparam tuning

Motion-semantic alignment is used to tune two hyperparameters
Increasing the number of channels improves segmentation quality, but saturates at C = 4
Object channel index c o needs to be obtained at the end of each training run
Leverage redundancy in video sequences and use only first frame of each video sequence to find c o

Ablation study

Residual pathway contributes 5.4%
Feature merging contributes 3.4%
Explicit appearance refinement boosts performance by 10.1%
CRF post-processing boosts performance by 12.2%

Visualizations and discussions

IMAS is compared to [26,50]
IMAS can handle complex foreground motion, distracting background motion, and camera motion
IMAS has limitations, such as not working when neither motion nor appearance provides informative signals
IMAS can recognize multiple foreground objects when they move in sync, but sometimes only captures one when they move differently
IMAS is not designed to separate multiple foreground objects

Summary

Presents IMAS, an unsupervised video object segmentation method
Leverages motion-appearance synergy
Object discovery stage with conflict-resolving learnable residual pathway
Refinement stage with appearance supervision
Motion-semantic alignment as annotation-free hyperparam tuning method

Applying motion-semantic alignment on previous work

Our hyperparam tuning method is model-agnostic.
Our method finds the same optimal number of channels and object channel index as using the validation set performance with human annotation.

Additional implementation details

Treat video frame pair {t, t + 1} as both a forward and backward action
Use symmetric loss to apply loss function both forward and backward
Use random crop augmentation at training and original image at test
Resize images for STv2 and FBMS59
Use pixel-wise photo-metric transformation for augmentation
Change head of ResNet by fusing feature from first and third residual block
Load self-supervised ImageNet pretrained weights
Model non-uniform 2D flow from object rotation in 3D
Capture multiple objects in a foreground group
Robust to misleading common motion and camera motion
Select to focus on one of the foreground objects if one has significantly larger motion
Tuned hyperparams from unsupervised motion-semantic alignment
Discard potential constraint if match is greater than 80% of width or 90% of height

Per-sequence results

Results on DAVIS16 listed in Table 6
Results on STv2 listed in Table 7
Results on FBMS59 listed in Table 8

Future directions

Method does not leverage temporal consistency
Could be more robust by using neighboring frames
Temporal consistency measures could be taken care of with additional loss term or post-processing
Does not support segmenting multiple parts of the foreground
Generates high-quality segmentation and is robust to uninformative or misleading motion signals
Framework has a motion supervision module and an appearance-based refinement stage
Semantic constraint mitigates false positives from misleading motion signals
Selects objectness channel with motion-semantic alignment
Tuned hyperparams from motion-semantic alignment align with ones from human annotation
Robust to uninformative or misleading motion cues
Model agnostic hyperparam tuning technique

Link to paper#

Abstract#

Paper Content#

Introduction#

Method#

Related work#

Problem setting#

Methods#

Experiments#

Datasets#

Unsupervised video object segmentation#

Motion-semantic alignment for hyperparam tuning#

Ablation study#

Visualizations and discussions#

Summary#

Applying motion-semantic alignment on previous work#

Additional implementation details#

Per-sequence results#

Future directions#

Link to paper

Abstract

Paper Content

Introduction

Method

Related work

Problem setting

Methods

Experiments

Datasets

Unsupervised video object segmentation

Motion-semantic alignment for hyperparam tuning

Ablation study

Visualizations and discussions

Summary

Applying motion-semantic alignment on previous work

Additional implementation details

Per-sequence results

Future directions