Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- IMAS is a method for segmenting primary objects in videos without manual annotation.
- IMAS uses motion-appearance synergy to deal with motion-appearance conflicts.
- IMAS has two training stages: motion-supervised object discovery and refinement.
- IMAS proposes motion-semantic alignment as a model-agnostic annotation-free hyperparam tuning method.
- IMAS improves segmentation quality on several UVOS benchmarks.
Paper Content
Introduction
- Video object segmentation is a widely researched topic in computer vision.
- Many algorithms require manual annotation which is costly to obtain.
- Unsupervised video object segmentation (UVOS) has gained popularity in recent years.
Method
- IMAS consists of two stages: motion-supervised object discovery and appearancesupervised refinement
- Neither stage requires human annotation, making IMAS fully unsupervised
- Sec. 3.1 presents the problem setting, Sec. 3.2 and 3.3 describe contributions to the first and second stages, and Sec. 3.4 presents motion-semantic alignment as a model-agnostic unsupervised hyperparam tuner
Related work
- UVOS requires segmenting objects from video sequences without human annotation
- Previous UVOS methods require supervised training with manual annotation
- Focus on fully unsupervised VOS, no human annotation needed
- Motion segmentation uses motion information to separate foreground and background
- Appearance-based UVOS methods use raw images as input to detect objects without relative motion
Problem setting
- Let I t ∈ R 3×h×w be the t th frame from a sequence of T RGB frames
- Objective of UVOS is a binary segmentation mask M t ∈ {0, 1} h×w for each timestep t
- Mean Jaccard index J (i.e., mean IoU) between predicted segmentation mask M t and ground truth G t
- No human-annotated data used throughout training and inference
- Use off-the-shelf optical flow model RAFT to provide motion cues between consecutive frames
- Piecewise constant pathway and learnable residual pathway to form the final flow prediction Ft
- Training objective of stage 1 is to minimize the L1 loss between the predicted reconstruction flow Ft and target flow F t
- Appearance supervision with low-level cues and semantic constraint
- Motion-semantic alignment metric to quantify segmentation quality
- Select hyperparam values with highest mean IoU
Methods
Experiments
Datasets
- Evaluated methods on 3 datasets commonly used to benchmark UVOS
- Dataset DAVIS2016 contains 50 video sequences of 3,455 frames
- Performance evaluated on validation set of 20 videos with annotation at 480p resolution
- SegTrackv2 contains 14 videos of different resolutions with 976 annotated frames
- FBMS59 contains 59 videos with 13,860 frames and 720 frames annotated with a fixed interval
- Merged multiple foreground objects in STv2 and FBMS59 into one mask
- Trained on all unlabeled videos
- Evaluation metric is mean Jac-card index J (mIoU)
Unsupervised video object segmentation
- Uses ResNet50 backbone and two heads with 3 Conv-BN-ReLU layers
- Object channels determined without human annotation
- RAFT model trained on synthetic datasets without human annotation
- IMAS outperforms previous methods by a large margin
- IMAS surpasses previous state-of-the-art method by 5.6% without post-processing
Motion-semantic alignment for hyperparam tuning
- Motion-semantic alignment is used to tune two hyperparameters
- Increasing the number of channels improves segmentation quality, but saturates at C = 4
- Object channel index c o needs to be obtained at the end of each training run
- Leverage redundancy in video sequences and use only first frame of each video sequence to find c o
Ablation study
- Residual pathway contributes 5.4%
- Feature merging contributes 3.4%
- Explicit appearance refinement boosts performance by 10.1%
- CRF post-processing boosts performance by 12.2%
Visualizations and discussions
- IMAS is compared to [26,50]
- IMAS can handle complex foreground motion, distracting background motion, and camera motion
- IMAS has limitations, such as not working when neither motion nor appearance provides informative signals
- IMAS can recognize multiple foreground objects when they move in sync, but sometimes only captures one when they move differently
- IMAS is not designed to separate multiple foreground objects
Summary
- Presents IMAS, an unsupervised video object segmentation method
- Leverages motion-appearance synergy
- Object discovery stage with conflict-resolving learnable residual pathway
- Refinement stage with appearance supervision
- Motion-semantic alignment as annotation-free hyperparam tuning method
Applying motion-semantic alignment on previous work
- Our hyperparam tuning method is model-agnostic.
- Our method finds the same optimal number of channels and object channel index as using the validation set performance with human annotation.
Additional implementation details
- Treat video frame pair {t, t + 1} as both a forward and backward action
- Use symmetric loss to apply loss function both forward and backward
- Use random crop augmentation at training and original image at test
- Resize images for STv2 and FBMS59
- Use pixel-wise photo-metric transformation for augmentation
- Change head of ResNet by fusing feature from first and third residual block
- Load self-supervised ImageNet pretrained weights
- Model non-uniform 2D flow from object rotation in 3D
- Capture multiple objects in a foreground group
- Robust to misleading common motion and camera motion
- Select to focus on one of the foreground objects if one has significantly larger motion
- Tuned hyperparams from unsupervised motion-semantic alignment
- Discard potential constraint if match is greater than 80% of width or 90% of height
Per-sequence results
- Results on DAVIS16 listed in Table 6
- Results on STv2 listed in Table 7
- Results on FBMS59 listed in Table 8
Future directions
- Method does not leverage temporal consistency
- Could be more robust by using neighboring frames
- Temporal consistency measures could be taken care of with additional loss term or post-processing
- Does not support segmenting multiple parts of the foreground
- Generates high-quality segmentation and is robust to uninformative or misleading motion signals
- Framework has a motion supervision module and an appearance-based refinement stage
- Semantic constraint mitigates false positives from misleading motion signals
- Selects objectness channel with motion-semantic alignment
- Tuned hyperparams from motion-semantic alignment align with ones from human annotation
- Robust to uninformative or misleading motion cues
- Model agnostic hyperparam tuning technique