Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Current self-supervised monocular depth estimation methods rely on estimating a rigid-body motion representing camera motion.
These methods suffer from the scale ambiguity problem.
DepthP+P is a method that learns to estimate outputs in metric scale.
DepthP+P aligns two frames using a common ground plane to remove the effect of the rotation component.
Two neural networks are used to predict the depth and camera translation.
By assuming a known camera height, the induced 2D image motion of a 3D point can be used to reconstruct the target image.
Experiments on the KITTI driving dataset show that DepthP+P can be a metrically accurate alternative to current methods.

Human beings can easily reason about their surroundings and decompose them into different objects.
Autonomous vehicles need this ability to drive in different environments.
Training deep networks for estimating depth has been successful in computer vision research, but requires ground truth depth or stereo setup.
Self-supervised monocular depth estimation methods do not rely on stereo supervision and can use unlabeled videos for training.
These methods use a pose network to estimate the ego-motion between a source frame and the target frame and a depth network to estimate the depth of the target image.
Our approach uses the traditional planar parallax formulation to decompose the motion into a planar homography and a residual parallax.
We estimate the depth of each pixel with a monocular depth network and back-project them into 3D.
We calculate the perpendicular distance of each point to the road and estimate the translation between the camera origins.
Our approach is able to predict metric accurate depth without needing ground truth depth data.
Previous monocular depth methods can estimate depth and motion up to a scale.
Self-supervised monocular depth estimation models suffer from the scale ambiguity problem.
There are a few methods that do not require additional supervision and only use the camera height to achieve depth estimations in metric units.

Planar Parallax paradigm is used to understand 3D structure of a scene from multiple images
Sawhney proposes a formulation for the residual parallax using depth and distance to the plane
Irani et al. use formulation to derive rigidity constraint between pairs of points over multiple images
Irani et al. derive trifocal constraints and use them to propose a method for new view synthesis

Self-supervised monocular depth estimation approaches have been successful but suffer from scale ambiguity.
Median scaling approach is used to evaluate and compare these methods.
Proposed approach predicts depth maps in metric scale without ground truth depth supervision.

Our approach is based on the Planar Parallax decomposition
We introduce the notation and build our method to predict depth
We warp the source image I s and obtain the aligned image I w
We calculate the displacement between p w and p
We have two networks, one for estimating depth and one for estimating the translation between frames

Photometric loss function is a linear combination of L1 distance and SSIM
Photometric loss is calculated between target image and reconstructed image
Per-pixel minimum reprojection error is used
Total loss is a combination of photometric and smoothness loss
Minimum of the photometric loss is calculated across previous and next aligned images

U-Net architecture is used for the depth network
ResNet pre-trained on ImageNet is used as the encoder
Decoder is similar to one used by [10]
Output of the last sigmoid layer is multiplied by 250 to estimate depth
Input is a single target image
Output is per-pixel depth estimates in metric scale
Second network takes two images and outputs a 3-element vector representing the translation in metric scale

We use the Eigen split of the KITTI dataset to train and evaluate our model.
We use 45000 training and 1769 validation samples.
We evaluate our model on 697 test images using original and improved ground truth.
We pre-process the dataset by calculating homography between consecutive frames.

Deep learning model trained with view synthesis through planar parallax paradigm
Previous methods trained to estimate pose, DepthP+P novel approach
DepthP+P achieves significantly better results than initial models
Improvements proposed to improve SfMLearner performance
DepthP+P can estimate depth in metric scale
Stereo supervision improves performance of DepthP+P
DepthP+P comparable to Monodepth2 with ResNet50 backbone

Presented a new approach to self-supervised monocular depth estimation
Uses a known camera height to produce metrically accurate depth estimates
Only needs to estimate camera translation, not full rigid-body motion
Advantage over other scale-aware depth prediction methods
Unlocks potential of plane and parallax for efficient and metric-accurate depth estimation
Future direction of detecting moving foreground objects
Provided derivation of residual parallax
Investigated effect of using more accurate estimation of normal vector of road
Additional qualitative results of model
Quantitative results on KITTI with additional stereo supervision show improved results