Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Performance of video prediction has been improved by deep neural networks
  • Current methods require extra inputs for better performance
  • DMVFN proposed to achieve better video prediction performance with only RGB images
  • DMVFN has a differentiable routing module to perceive motion scales of video frames
  • DMVFN is faster than Deep Voxel Flow and surpasses OPT on generated image quality

Paper Content

Introduction

  • Aim to predict future video frames from current ones
  • Benefits representation learning and downstream forecasting tasks
  • Video prediction studied in academia and industry
  • Challenging due to diverse and complex motion patterns
  • Early methods use recurrent neural networks
  • Semantic/instance maps used for semantically coherent motion estimation
  • OPT uses only RGB images to estimate optical flow
  • Need to develop single model for multiscale motion estimation
  • DMVFN proposed to model complex motion cues of diverse scales
  • Routing Module to adaptively generate routing vector
  • Experiments on four benchmarks show state-of-the-art results

Video prediction

  • Early video prediction methods used only RGB frames as inputs
  • Later methods used extra information for better performance
  • This paper develops a light-weight and efficient video prediction network that requires only sRGB images as inputs

Optical flow

  • Optical flow estimation is used to measure motion between frames
  • Deep learning-based optical flow models have been improved since Flownet
  • Flownet2.0 uses subnetworks for iterative refinement
  • SPynet uses a coarse-to-fine spatial pyramid network
  • PWC-Net uses feature warping and a cost volume layer
  • RAFT uses a lightweight recurrent network
  • Flow-Former uses an encoder and a recurrent decoder
  • Optical flow is used for video prediction tasks

Dynamic network

  • Dynamic networks are divided into three categories: spatial-wise, temporal-wise, and sample-wise
  • Spatial-wise dynamic networks reduce computational redundancy and keep performance comparable
  • Temporal-wise dynamic networks improve efficiency by performing less or no computation on unimportant sequence elements
  • Sample-wise dynamic networks adaptively change network parameters or structures to reduce extra computation

Methodology

Background

  • Video prediction aims to predict future frames given a sequence of past frames.
  • The input of the video prediction model is two consecutive frames.
  • The learning objective is to minimize the difference between the predicted frame and the “ground truth” frame.
  • Pixel-wise backward warping is used to estimate optical flow from one frame to another.
  • A fusion map is used to fuse pixels from two frames.

Dynamic multi-scale voxel flow network

  • MVFB estimates voxel flow end-to-end without introducing new components or constraints.
  • MVFB has a two-branch network structure to capture large motion while preserving spatial information.
  • DMVFN contains 9 MVFBs with scaling factors.
  • DMVFN has dynamic routing to select sub-network based on input sample.
  • DMVFN can choose paths freely.
  • Routing Module predicts routing vector v for each input sample.
  • MVFB refines current voxel flow estimation to a new one.
  • Routing vector v identifies proper sub-network.
  • Gumbel Softmax technique used to make routing probability learnable.
  • STE used to make binary dynamic routing vector differentiable.
  • β used to control complexity of DMVFN.

Implementation details

  • Loss function is the sum of reconstruction losses of outputs of each block
  • Training strategy uses AdamW optimizer with a weight decay of 10-4
  • Batch size is 64 and image patches are 224x224
  • Learning rate is reduced from 10-4 to 10-5
  • Model is trained on four 2080Ti GPUs for 300 epochs, taking about 35 hours

Experiments

Dataset and metric

  • We use several datasets for experiments
  • Previous methods cannot accurately predict car’s location in long-term prediction
  • DMVFN motion is most similar to ground truth
  • We use Cityscapes and KITTI datasets for training and testing
  • We use MS-SSIM and LPIPS for quantitative evaluation and GFLOPs for model complexity

Comparison to state-of-the-arts

  • DMVFN compared to state-of-the-art video prediction methods
  • DMVFN achieves better results than other methods in short-term and long-term video prediction
  • DMVFN reduces GFLOPs while maintaining comparable performance
  • DMVFN is 4.06x faster than STRPM [7]
  • DMVFN predicts frames with better temporal continuity and more consistent with ground truth

Ablation study

  • Performed ablation studies to study effectiveness of components in DMVFN
  • Tested DMVFN on Cityscapes and KITTI datasets
  • Divided Vimeo-90K testing set into three subsets to verify DMVFN can perceive motion scales
  • Tested DMVFN on same video with different time intervals
  • Selected 103 video sequences from KITTI dataset to study how blocks are selected
  • Compared DMVFN with different differentiable routing methods
  • Evaluated DMVFN with different scaling factors
  • Verified effectiveness of spatial path with or without routing

Conclusion