Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Current self-supervised monocular depth estimation methods rely on estimating a rigid-body motion representing camera motion.
- These methods suffer from the scale ambiguity problem.
- DepthP+P is a method that learns to estimate outputs in metric scale.
- DepthP+P aligns two frames using a common ground plane to remove the effect of the rotation component.
- Two neural networks are used to predict the depth and camera translation.
- By assuming a known camera height, the induced 2D image motion of a 3D point can be used to reconstruct the target image.
- Experiments on the KITTI driving dataset show that DepthP+P can be a metrically accurate alternative to current methods.
Paper Content
Introduction
- Human beings can easily reason about their surroundings and decompose them into different objects.
- Autonomous vehicles need this ability to drive in different environments.
- Training deep networks for estimating depth has been successful in computer vision research, but requires ground truth depth or stereo setup.
- Self-supervised monocular depth estimation methods do not rely on stereo supervision and can use unlabeled videos for training.
- These methods use a pose network to estimate the ego-motion between a source frame and the target frame and a depth network to estimate the depth of the target image.
- Our approach uses the traditional planar parallax formulation to decompose the motion into a planar homography and a residual parallax.
- We estimate the depth of each pixel with a monocular depth network and back-project them into 3D.
- We calculate the perpendicular distance of each point to the road and estimate the translation between the camera origins.
- Our approach is able to predict metric accurate depth without needing ground truth depth data.
- Previous monocular depth methods can estimate depth and motion up to a scale.
- Self-supervised monocular depth estimation models suffer from the scale ambiguity problem.
- There are a few methods that do not require additional supervision and only use the camera height to achieve depth estimations in metric units.
Planar parallax
- Planar Parallax paradigm is used to understand 3D structure of a scene from multiple images
- Sawhney proposes a formulation for the residual parallax using depth and distance to the plane
- Irani et al. use formulation to derive rigidity constraint between pairs of points over multiple images
- Irani et al. derive trifocal constraints and use them to propose a method for new view synthesis
Methodology
- Self-supervised monocular depth estimation approaches have been successful but suffer from scale ambiguity.
- Median scaling approach is used to evaluate and compare these methods.
- Proposed approach predicts depth maps in metric scale without ground truth depth supervision.
Depthp+p
- Our approach is based on the Planar Parallax decomposition
- We introduce the notation and build our method to predict depth
- We warp the source image I s and obtain the aligned image I w
- We calculate the displacement between p w and p
- We have two networks, one for estimating depth and one for estimating the translation between frames
Self-supervised training loss
- Photometric loss function is a linear combination of L1 distance and SSIM
- Photometric loss is calculated between target image and reconstructed image
- Per-pixel minimum reprojection error is used
- Total loss is a combination of photometric and smoothness loss
- Minimum of the photometric loss is calculated across previous and next aligned images
Network architecture
- U-Net architecture is used for the depth network
- ResNet pre-trained on ImageNet is used as the encoder
- Decoder is similar to one used by [10]
- Output of the last sigmoid layer is multiplied by 250 to estimate depth
- Input is a single target image
- Output is per-pixel depth estimates in metric scale
- Second network takes two images and outputs a 3-element vector representing the translation in metric scale
Experiments
- We use the Eigen split of the KITTI dataset to train and evaluate our model.
- We use 45000 training and 1769 validation samples.
- We evaluate our model on 697 test images using original and improved ground truth.
- We pre-process the dataset by calculating homography between consecutive frames.
Depth estimation results
- Deep learning model trained with view synthesis through planar parallax paradigm
- Previous methods trained to estimate pose, DepthP+P novel approach
- DepthP+P achieves significantly better results than initial models
- Improvements proposed to improve SfMLearner performance
- DepthP+P can estimate depth in metric scale
- Stereo supervision improves performance of DepthP+P
- DepthP+P comparable to Monodepth2 with ResNet50 backbone
Conclusion and future work
- Presented a new approach to self-supervised monocular depth estimation
- Uses a known camera height to produce metrically accurate depth estimates
- Only needs to estimate camera translation, not full rigid-body motion
- Advantage over other scale-aware depth prediction methods
- Unlocks potential of plane and parallax for efficient and metric-accurate depth estimation
- Future direction of detecting moving foreground objects
- Provided derivation of residual parallax
- Investigated effect of using more accurate estimation of normal vector of road
- Additional qualitative results of model
- Quantitative results on KITTI with additional stereo supervision show improved results