Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Current self-supervised monocular depth estimation methods rely on estimating a rigid-body motion representing camera motion.
  • These methods suffer from the scale ambiguity problem.
  • DepthP+P is a method that learns to estimate outputs in metric scale.
  • DepthP+P aligns two frames using a common ground plane to remove the effect of the rotation component.
  • Two neural networks are used to predict the depth and camera translation.
  • By assuming a known camera height, the induced 2D image motion of a 3D point can be used to reconstruct the target image.
  • Experiments on the KITTI driving dataset show that DepthP+P can be a metrically accurate alternative to current methods.

Paper Content

Introduction

  • Human beings can easily reason about their surroundings and decompose them into different objects.
  • Autonomous vehicles need this ability to drive in different environments.
  • Training deep networks for estimating depth has been successful in computer vision research, but requires ground truth depth or stereo setup.
  • Self-supervised monocular depth estimation methods do not rely on stereo supervision and can use unlabeled videos for training.
  • These methods use a pose network to estimate the ego-motion between a source frame and the target frame and a depth network to estimate the depth of the target image.
  • Our approach uses the traditional planar parallax formulation to decompose the motion into a planar homography and a residual parallax.
  • We estimate the depth of each pixel with a monocular depth network and back-project them into 3D.
  • We calculate the perpendicular distance of each point to the road and estimate the translation between the camera origins.
  • Our approach is able to predict metric accurate depth without needing ground truth depth data.
  • Previous monocular depth methods can estimate depth and motion up to a scale.
  • Self-supervised monocular depth estimation models suffer from the scale ambiguity problem.
  • There are a few methods that do not require additional supervision and only use the camera height to achieve depth estimations in metric units.

Planar parallax

  • Planar Parallax paradigm is used to understand 3D structure of a scene from multiple images
  • Sawhney proposes a formulation for the residual parallax using depth and distance to the plane
  • Irani et al. use formulation to derive rigidity constraint between pairs of points over multiple images
  • Irani et al. derive trifocal constraints and use them to propose a method for new view synthesis

Methodology

  • Self-supervised monocular depth estimation approaches have been successful but suffer from scale ambiguity.
  • Median scaling approach is used to evaluate and compare these methods.
  • Proposed approach predicts depth maps in metric scale without ground truth depth supervision.

Depthp+p

  • Our approach is based on the Planar Parallax decomposition
  • We introduce the notation and build our method to predict depth
  • We warp the source image I s and obtain the aligned image I w
  • We calculate the displacement between p w and p
  • We have two networks, one for estimating depth and one for estimating the translation between frames

Self-supervised training loss

  • Photometric loss function is a linear combination of L1 distance and SSIM
  • Photometric loss is calculated between target image and reconstructed image
  • Per-pixel minimum reprojection error is used
  • Total loss is a combination of photometric and smoothness loss
  • Minimum of the photometric loss is calculated across previous and next aligned images

Network architecture

  • U-Net architecture is used for the depth network
  • ResNet pre-trained on ImageNet is used as the encoder
  • Decoder is similar to one used by [10]
  • Output of the last sigmoid layer is multiplied by 250 to estimate depth
  • Input is a single target image
  • Output is per-pixel depth estimates in metric scale
  • Second network takes two images and outputs a 3-element vector representing the translation in metric scale

Experiments

  • We use the Eigen split of the KITTI dataset to train and evaluate our model.
  • We use 45000 training and 1769 validation samples.
  • We evaluate our model on 697 test images using original and improved ground truth.
  • We pre-process the dataset by calculating homography between consecutive frames.

Depth estimation results

  • Deep learning model trained with view synthesis through planar parallax paradigm
  • Previous methods trained to estimate pose, DepthP+P novel approach
  • DepthP+P achieves significantly better results than initial models
  • Improvements proposed to improve SfMLearner performance
  • DepthP+P can estimate depth in metric scale
  • Stereo supervision improves performance of DepthP+P
  • DepthP+P comparable to Monodepth2 with ResNet50 backbone

Conclusion and future work

  • Presented a new approach to self-supervised monocular depth estimation
  • Uses a known camera height to produce metrically accurate depth estimates
  • Only needs to estimate camera translation, not full rigid-body motion
  • Advantage over other scale-aware depth prediction methods
  • Unlocks potential of plane and parallax for efficient and metric-accurate depth estimation
  • Future direction of detecting moving foreground objects
  • Provided derivation of residual parallax
  • Investigated effect of using more accurate estimation of normal vector of road
  • Additional qualitative results of model
  • Quantitative results on KITTI with additional stereo supervision show improved results