Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Performance of video prediction has been improved by deep neural networks
- Current methods require extra inputs for better performance
- DMVFN proposed to achieve better video prediction performance with only RGB images
- DMVFN has a differentiable routing module to perceive motion scales of video frames
- DMVFN is faster than Deep Voxel Flow and surpasses OPT on generated image quality
Paper Content
Introduction
- Aim to predict future video frames from current ones
- Benefits representation learning and downstream forecasting tasks
- Video prediction studied in academia and industry
- Challenging due to diverse and complex motion patterns
- Early methods use recurrent neural networks
- Semantic/instance maps used for semantically coherent motion estimation
- OPT uses only RGB images to estimate optical flow
- Need to develop single model for multiscale motion estimation
- DMVFN proposed to model complex motion cues of diverse scales
- Routing Module to adaptively generate routing vector
- Experiments on four benchmarks show state-of-the-art results
Related work
Video prediction
- Early video prediction methods used only RGB frames as inputs
- Later methods used extra information for better performance
- This paper develops a light-weight and efficient video prediction network that requires only sRGB images as inputs
Optical flow
- Optical flow estimation is used to measure motion between frames
- Deep learning-based optical flow models have been improved since Flownet
- Flownet2.0 uses subnetworks for iterative refinement
- SPynet uses a coarse-to-fine spatial pyramid network
- PWC-Net uses feature warping and a cost volume layer
- RAFT uses a lightweight recurrent network
- Flow-Former uses an encoder and a recurrent decoder
- Optical flow is used for video prediction tasks
Dynamic network
- Dynamic networks are divided into three categories: spatial-wise, temporal-wise, and sample-wise
- Spatial-wise dynamic networks reduce computational redundancy and keep performance comparable
- Temporal-wise dynamic networks improve efficiency by performing less or no computation on unimportant sequence elements
- Sample-wise dynamic networks adaptively change network parameters or structures to reduce extra computation
Methodology
Background
- Video prediction aims to predict future frames given a sequence of past frames.
- The input of the video prediction model is two consecutive frames.
- The learning objective is to minimize the difference between the predicted frame and the “ground truth” frame.
- Pixel-wise backward warping is used to estimate optical flow from one frame to another.
- A fusion map is used to fuse pixels from two frames.
Dynamic multi-scale voxel flow network
- MVFB estimates voxel flow end-to-end without introducing new components or constraints.
- MVFB has a two-branch network structure to capture large motion while preserving spatial information.
- DMVFN contains 9 MVFBs with scaling factors.
- DMVFN has dynamic routing to select sub-network based on input sample.
- DMVFN can choose paths freely.
- Routing Module predicts routing vector v for each input sample.
- MVFB refines current voxel flow estimation to a new one.
- Routing vector v identifies proper sub-network.
- Gumbel Softmax technique used to make routing probability learnable.
- STE used to make binary dynamic routing vector differentiable.
- β used to control complexity of DMVFN.
Implementation details
- Loss function is the sum of reconstruction losses of outputs of each block
- Training strategy uses AdamW optimizer with a weight decay of 10-4
- Batch size is 64 and image patches are 224x224
- Learning rate is reduced from 10-4 to 10-5
- Model is trained on four 2080Ti GPUs for 300 epochs, taking about 35 hours
Experiments
Dataset and metric
- We use several datasets for experiments
- Previous methods cannot accurately predict car’s location in long-term prediction
- DMVFN motion is most similar to ground truth
- We use Cityscapes and KITTI datasets for training and testing
- We use MS-SSIM and LPIPS for quantitative evaluation and GFLOPs for model complexity
Comparison to state-of-the-arts
- DMVFN compared to state-of-the-art video prediction methods
- DMVFN achieves better results than other methods in short-term and long-term video prediction
- DMVFN reduces GFLOPs while maintaining comparable performance
- DMVFN is 4.06x faster than STRPM [7]
- DMVFN predicts frames with better temporal continuity and more consistent with ground truth
Ablation study
- Performed ablation studies to study effectiveness of components in DMVFN
- Tested DMVFN on Cityscapes and KITTI datasets
- Divided Vimeo-90K testing set into three subsets to verify DMVFN can perceive motion scales
- Tested DMVFN on same video with different time intervals
- Selected 103 video sequences from KITTI dataset to study how blocks are selected
- Compared DMVFN with different differentiable routing methods
- Evaluated DMVFN with different scaling factors
- Verified effectiveness of spatial path with or without routing