Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Monocular depth estimation is a challenging task that predicts the pixel-wise depth from a single 2D image.
DiffusionDepth is a new approach that reformulates monocular depth estimation as a denoising diffusion process.
The model learns to reverse the process of diffusing the refined depth of itself into random depth distribution.
DiffusionDepth is superior for generating accurate and highly detailed depth maps.

Paper Content

Introduction

Monocular depth estimation has numerous applications
Mainstream methods employ CNNs for dense per-pixel regression
Follow-up approaches modify the backbone structure for better visual feature
Transformer structures are introduced for higher performance
Pure regression methods suffer from overfitting and unsatisfactory object details
Additional constraints such as uncertainty and planarity prior are used to increase robustness
NewCRFs introduces window-separated CRF to enhance local space relation
DORN and Soft Ordinary discretize depth into intervals and reformulate as classification problem
Follow-up methods merge regression and classification
This paper proposes DiffusionDepth, a novel framework for monocular depth estimation
DiffusionDepth performs iterative denoising process to capture coarse and fine details
Self-diffusion process is used to address sparse ground truth depth problem
DiffusionDepth achieves SOTA performance on KITTI and NYU-Depth-V2 datasets

Monocular Depth Estimation is a task in computer vision to estimate depth map from a single RGB image
Early approaches used Markov random field, more recent approaches use deep convolutional neural networks
Formulated as a dense per-pixel regression problem
Followup approaches focus on modifying the backbone structure to enhance visual features
Transformer structures have been introduced to improve performance
Additional constraints such as uncertainty and piecewise planarity prior have been introduced
Diffusion model has been introduced to refine depth prediction

Methodology

Task reformulation

Diffusion models are a class of latent variable models used for generative tasks.
Neural networks are trained to denoise images blurred with Gaussian noise by learning to reverse the diffusion process.
Depth estimation is reformulated as a visual-condition guided denoising process which refines the depth distribution iteratively.

Network architecture

Swin Transformer is used as an example to illustrate feature extraction
Input image is patched and projected into visual tokens with position embedding
Backbone extracts visual features at different scales
Hierarchical aggregation and heterogeneous interaction is used to enhance features between scales
Feature pyramid neck is used to aggregate features into monocular visual condition
DiffusionDepth model is suitable for most visual backbones

Monocular conditioned denoising block

Depth estimation is formulated as a denoising process.
Neural network model takes visual condition and current depth latent to predict a distribution.
Monocular Conditioned Denoising Block is used to achieve the denoising process.
Visual condition is fused with depth latent through hierarchically.

Diffusion-denosing process

Diffusion process and denoising process are defined in equations
Trainable parameters are conditioned denoising model and visual feature extractors
Model is trained by minimizing loss between diffusion results and denoising prediction
Depth is calculated through equation
Encoder and decoder are trained by minimizing pixel-wise depth loss
Supervision is applied to both latent spaces through L2 loss
Self-diffusion process is used to tackle sparse ground truth depth value problem

Experiment

Experiments conducted on outdoor and indoor scenarios to evaluate proposed DiffusionDepth and its properties
KITTI dataset used for outdoor evaluation, image resolution 1216 × 352 pixels, sparse GT depth (density 3.75% to 5%)
Evaluation metrics from corresponding original papers, evaluation range 0-80m
NYU-Depth-v2 dataset used for indoor evaluation, resolution 640 × 480 pixels, dense depth GT (density > 95%)
Model implemented with Pytorch framework, trained with batch size 16 for 30 epochs iterations on 8 NVIDIA A100 40G GPUs
Data augmentation used for training, random crop, color jitter, random scale, random flip
Compatible with convolution-based ResNet and transformer-based Swin backbones
Improved sampling process used with 1000 diffusion steps for training and 20 inference steps for inference
Dimension d of encoded depth latent is 16

Benchmark comparison with sota methods

DiffusionDepth outperforms current SOTA models on KITTI offline Eigen split and official offline split
DiffusionDepth slightly underperformed compared to SOTA models on KITTI Online evaluation
DiffusionDepth brings improved visual quality and clear shapes for practical utilization
DiffusionDepth has higher improvement than outdoor scenarios on NYU-Depth-V2 dataset

Ablation study

Denoising process refines depth prediction step by step
Denoising process initializes shapes and edges from random depth distribution
Denoising process corrects distance relations with visual clues
Ablation study conducted on NYU-Depth-V2 dataset to reveal properties of different inference settings

Conclusion

Reformulated monocular depth estimation as a diffusion-denoising approach
Proposed model reaches state-of-the-art performance
Verified feasibility of introducing diffusion-denoising model into 3D perception tasks
Iterative refinement of depth latent generates accurate and detailed depth maps
Ablation study provided to give insights for follow-up works
Diffusion process refines shapes and basic structure of depth map

Link to paper#

Abstract#

Paper Content#

Introduction#

Related works#

Methodology#

Task reformulation#

Network architecture#

Monocular conditioned denoising block#

Diffusion-denosing process#

Experiment#

Benchmark comparison with sota methods#

Ablation study#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related works

Methodology

Task reformulation

Network architecture

Monocular conditioned denoising block

Diffusion-denosing process

Experiment

Benchmark comparison with sota methods

Ablation study

Conclusion