Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Monocular depth estimation is a challenging task that predicts the pixel-wise depth from a single 2D image.
  • DiffusionDepth is a new approach that reformulates monocular depth estimation as a denoising diffusion process.
  • The model learns to reverse the process of diffusing the refined depth of itself into random depth distribution.
  • DiffusionDepth is superior for generating accurate and highly detailed depth maps.

Paper Content

Introduction

  • Monocular depth estimation has numerous applications
  • Mainstream methods employ CNNs for dense per-pixel regression
  • Follow-up approaches modify the backbone structure for better visual feature
  • Transformer structures are introduced for higher performance
  • Pure regression methods suffer from overfitting and unsatisfactory object details
  • Additional constraints such as uncertainty and planarity prior are used to increase robustness
  • NewCRFs introduces window-separated CRF to enhance local space relation
  • DORN and Soft Ordinary discretize depth into intervals and reformulate as classification problem
  • Follow-up methods merge regression and classification
  • This paper proposes DiffusionDepth, a novel framework for monocular depth estimation
  • DiffusionDepth performs iterative denoising process to capture coarse and fine details
  • Self-diffusion process is used to address sparse ground truth depth problem
  • DiffusionDepth achieves SOTA performance on KITTI and NYU-Depth-V2 datasets
  • Monocular Depth Estimation is a task in computer vision to estimate depth map from a single RGB image
  • Early approaches used Markov random field, more recent approaches use deep convolutional neural networks
  • Formulated as a dense per-pixel regression problem
  • Followup approaches focus on modifying the backbone structure to enhance visual features
  • Transformer structures have been introduced to improve performance
  • Additional constraints such as uncertainty and piecewise planarity prior have been introduced
  • Diffusion model has been introduced to refine depth prediction

Methodology

Task reformulation

  • Diffusion models are a class of latent variable models used for generative tasks.
  • Neural networks are trained to denoise images blurred with Gaussian noise by learning to reverse the diffusion process.
  • Depth estimation is reformulated as a visual-condition guided denoising process which refines the depth distribution iteratively.

Network architecture

  • Swin Transformer is used as an example to illustrate feature extraction
  • Input image is patched and projected into visual tokens with position embedding
  • Backbone extracts visual features at different scales
  • Hierarchical aggregation and heterogeneous interaction is used to enhance features between scales
  • Feature pyramid neck is used to aggregate features into monocular visual condition
  • DiffusionDepth model is suitable for most visual backbones

Monocular conditioned denoising block

  • Depth estimation is formulated as a denoising process.
  • Neural network model takes visual condition and current depth latent to predict a distribution.
  • Monocular Conditioned Denoising Block is used to achieve the denoising process.
  • Visual condition is fused with depth latent through hierarchically.

Diffusion-denosing process

  • Diffusion process and denoising process are defined in equations
  • Trainable parameters are conditioned denoising model and visual feature extractors
  • Model is trained by minimizing loss between diffusion results and denoising prediction
  • Depth is calculated through equation
  • Encoder and decoder are trained by minimizing pixel-wise depth loss
  • Supervision is applied to both latent spaces through L2 loss
  • Self-diffusion process is used to tackle sparse ground truth depth value problem

Experiment

  • Experiments conducted on outdoor and indoor scenarios to evaluate proposed DiffusionDepth and its properties
  • KITTI dataset used for outdoor evaluation, image resolution 1216 ร— 352 pixels, sparse GT depth (density 3.75% to 5%)
  • Evaluation metrics from corresponding original papers, evaluation range 0-80m
  • NYU-Depth-v2 dataset used for indoor evaluation, resolution 640 ร— 480 pixels, dense depth GT (density > 95%)
  • Model implemented with Pytorch framework, trained with batch size 16 for 30 epochs iterations on 8 NVIDIA A100 40G GPUs
  • Data augmentation used for training, random crop, color jitter, random scale, random flip
  • Compatible with convolution-based ResNet and transformer-based Swin backbones
  • Improved sampling process used with 1000 diffusion steps for training and 20 inference steps for inference
  • Dimension d of encoded depth latent is 16

Benchmark comparison with sota methods

  • DiffusionDepth outperforms current SOTA models on KITTI offline Eigen split and official offline split
  • DiffusionDepth slightly underperformed compared to SOTA models on KITTI Online evaluation
  • DiffusionDepth brings improved visual quality and clear shapes for practical utilization
  • DiffusionDepth has higher improvement than outdoor scenarios on NYU-Depth-V2 dataset

Ablation study

  • Denoising process refines depth prediction step by step
  • Denoising process initializes shapes and edges from random depth distribution
  • Denoising process corrects distance relations with visual clues
  • Ablation study conducted on NYU-Depth-V2 dataset to reveal properties of different inference settings

Conclusion

  • Reformulated monocular depth estimation as a diffusion-denoising approach
  • Proposed model reaches state-of-the-art performance
  • Verified feasibility of introducing diffusion-denoising model into 3D perception tasks
  • Iterative refinement of depth latent generates accurate and detailed depth maps
  • Ablation study provided to give insights for follow-up works
  • Diffusion process refines shapes and basic structure of depth map