Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Performance of video prediction has been improved by deep neural networks
Current methods require extra inputs for better performance
DMVFN proposed to achieve better video prediction performance with only RGB images
DMVFN has a differentiable routing module to perceive motion scales of video frames
DMVFN is faster than Deep Voxel Flow and surpasses OPT on generated image quality

Paper Content

Introduction

Aim to predict future video frames from current ones
Benefits representation learning and downstream forecasting tasks
Video prediction studied in academia and industry
Challenging due to diverse and complex motion patterns
Early methods use recurrent neural networks
Semantic/instance maps used for semantically coherent motion estimation
OPT uses only RGB images to estimate optical flow
Need to develop single model for multiscale motion estimation
DMVFN proposed to model complex motion cues of diverse scales
Routing Module to adaptively generate routing vector
Experiments on four benchmarks show state-of-the-art results

Video prediction

Early video prediction methods used only RGB frames as inputs
Later methods used extra information for better performance
This paper develops a light-weight and efficient video prediction network that requires only sRGB images as inputs

Optical flow

Optical flow estimation is used to measure motion between frames
Deep learning-based optical flow models have been improved since Flownet
Flownet2.0 uses subnetworks for iterative refinement
SPynet uses a coarse-to-fine spatial pyramid network
PWC-Net uses feature warping and a cost volume layer
RAFT uses a lightweight recurrent network
Flow-Former uses an encoder and a recurrent decoder
Optical flow is used for video prediction tasks

Dynamic network

Dynamic networks are divided into three categories: spatial-wise, temporal-wise, and sample-wise
Spatial-wise dynamic networks reduce computational redundancy and keep performance comparable
Temporal-wise dynamic networks improve efficiency by performing less or no computation on unimportant sequence elements
Sample-wise dynamic networks adaptively change network parameters or structures to reduce extra computation

Methodology

Background

Video prediction aims to predict future frames given a sequence of past frames.
The input of the video prediction model is two consecutive frames.
The learning objective is to minimize the difference between the predicted frame and the “ground truth” frame.
Pixel-wise backward warping is used to estimate optical flow from one frame to another.
A fusion map is used to fuse pixels from two frames.

Dynamic multi-scale voxel flow network

MVFB estimates voxel flow end-to-end without introducing new components or constraints.
MVFB has a two-branch network structure to capture large motion while preserving spatial information.
DMVFN contains 9 MVFBs with scaling factors.
DMVFN has dynamic routing to select sub-network based on input sample.
DMVFN can choose paths freely.
Routing Module predicts routing vector v for each input sample.
MVFB refines current voxel flow estimation to a new one.
Routing vector v identifies proper sub-network.
Gumbel Softmax technique used to make routing probability learnable.
STE used to make binary dynamic routing vector differentiable.
β used to control complexity of DMVFN.

Implementation details

Loss function is the sum of reconstruction losses of outputs of each block
Training strategy uses AdamW optimizer with a weight decay of 10-4
Batch size is 64 and image patches are 224x224
Learning rate is reduced from 10-4 to 10-5
Model is trained on four 2080Ti GPUs for 300 epochs, taking about 35 hours

Experiments

Dataset and metric

We use several datasets for experiments
Previous methods cannot accurately predict car’s location in long-term prediction
DMVFN motion is most similar to ground truth
We use Cityscapes and KITTI datasets for training and testing
We use MS-SSIM and LPIPS for quantitative evaluation and GFLOPs for model complexity

Comparison to state-of-the-arts

DMVFN compared to state-of-the-art video prediction methods
DMVFN achieves better results than other methods in short-term and long-term video prediction
DMVFN reduces GFLOPs while maintaining comparable performance
DMVFN is 4.06x faster than STRPM [7]
DMVFN predicts frames with better temporal continuity and more consistent with ground truth

Ablation study

Performed ablation studies to study effectiveness of components in DMVFN
Tested DMVFN on Cityscapes and KITTI datasets
Divided Vimeo-90K testing set into three subsets to verify DMVFN can perceive motion scales
Tested DMVFN on same video with different time intervals
Selected 103 video sequences from KITTI dataset to study how blocks are selected
Compared DMVFN with different differentiable routing methods
Evaluated DMVFN with different scaling factors
Verified effectiveness of spatial path with or without routing

A Dynamic Multi-Scale Voxel Flow Network for Video Prediction

Link to paper

Abstract

Paper Content

Introduction

Video prediction

Optical flow

Dynamic network

Methodology

Background

Dynamic multi-scale voxel flow network

Implementation details

Experiments

Dataset and metric

Comparison to state-of-the-arts

Ablation study

Conclusion

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Video prediction#

Optical flow#

Dynamic network#

Methodology#

Background#

Dynamic multi-scale voxel flow network#

Implementation details#

Experiments#

Dataset and metric#

Comparison to state-of-the-arts#

Ablation study#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Video prediction

Optical flow

Dynamic network

Methodology

Background

Dynamic multi-scale voxel flow network

Implementation details

Experiments

Dataset and metric

Comparison to state-of-the-arts

Ablation study

Conclusion