Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Monocular depth estimation is a challenging task that predicts the pixel-wise depth from a single 2D image.
- DiffusionDepth is a new approach that reformulates monocular depth estimation as a denoising diffusion process.
- The model learns to reverse the process of diffusing the refined depth of itself into random depth distribution.
- DiffusionDepth is superior for generating accurate and highly detailed depth maps.
Paper Content
Introduction
- Monocular depth estimation has numerous applications
- Mainstream methods employ CNNs for dense per-pixel regression
- Follow-up approaches modify the backbone structure for better visual feature
- Transformer structures are introduced for higher performance
- Pure regression methods suffer from overfitting and unsatisfactory object details
- Additional constraints such as uncertainty and planarity prior are used to increase robustness
- NewCRFs introduces window-separated CRF to enhance local space relation
- DORN and Soft Ordinary discretize depth into intervals and reformulate as classification problem
- Follow-up methods merge regression and classification
- This paper proposes DiffusionDepth, a novel framework for monocular depth estimation
- DiffusionDepth performs iterative denoising process to capture coarse and fine details
- Self-diffusion process is used to address sparse ground truth depth problem
- DiffusionDepth achieves SOTA performance on KITTI and NYU-Depth-V2 datasets
Related works
- Monocular Depth Estimation is a task in computer vision to estimate depth map from a single RGB image
- Early approaches used Markov random field, more recent approaches use deep convolutional neural networks
- Formulated as a dense per-pixel regression problem
- Followup approaches focus on modifying the backbone structure to enhance visual features
- Transformer structures have been introduced to improve performance
- Additional constraints such as uncertainty and piecewise planarity prior have been introduced
- Diffusion model has been introduced to refine depth prediction
Methodology
Task reformulation
- Diffusion models are a class of latent variable models used for generative tasks.
- Neural networks are trained to denoise images blurred with Gaussian noise by learning to reverse the diffusion process.
- Depth estimation is reformulated as a visual-condition guided denoising process which refines the depth distribution iteratively.
Network architecture
- Swin Transformer is used as an example to illustrate feature extraction
- Input image is patched and projected into visual tokens with position embedding
- Backbone extracts visual features at different scales
- Hierarchical aggregation and heterogeneous interaction is used to enhance features between scales
- Feature pyramid neck is used to aggregate features into monocular visual condition
- DiffusionDepth model is suitable for most visual backbones
Monocular conditioned denoising block
- Depth estimation is formulated as a denoising process.
- Neural network model takes visual condition and current depth latent to predict a distribution.
- Monocular Conditioned Denoising Block is used to achieve the denoising process.
- Visual condition is fused with depth latent through hierarchically.
Diffusion-denosing process
- Diffusion process and denoising process are defined in equations
- Trainable parameters are conditioned denoising model and visual feature extractors
- Model is trained by minimizing loss between diffusion results and denoising prediction
- Depth is calculated through equation
- Encoder and decoder are trained by minimizing pixel-wise depth loss
- Supervision is applied to both latent spaces through L2 loss
- Self-diffusion process is used to tackle sparse ground truth depth value problem
Experiment
- Experiments conducted on outdoor and indoor scenarios to evaluate proposed DiffusionDepth and its properties
- KITTI dataset used for outdoor evaluation, image resolution 1216 ร 352 pixels, sparse GT depth (density 3.75% to 5%)
- Evaluation metrics from corresponding original papers, evaluation range 0-80m
- NYU-Depth-v2 dataset used for indoor evaluation, resolution 640 ร 480 pixels, dense depth GT (density > 95%)
- Model implemented with Pytorch framework, trained with batch size 16 for 30 epochs iterations on 8 NVIDIA A100 40G GPUs
- Data augmentation used for training, random crop, color jitter, random scale, random flip
- Compatible with convolution-based ResNet and transformer-based Swin backbones
- Improved sampling process used with 1000 diffusion steps for training and 20 inference steps for inference
- Dimension d of encoded depth latent is 16
Benchmark comparison with sota methods
- DiffusionDepth outperforms current SOTA models on KITTI offline Eigen split and official offline split
- DiffusionDepth slightly underperformed compared to SOTA models on KITTI Online evaluation
- DiffusionDepth brings improved visual quality and clear shapes for practical utilization
- DiffusionDepth has higher improvement than outdoor scenarios on NYU-Depth-V2 dataset
Ablation study
- Denoising process refines depth prediction step by step
- Denoising process initializes shapes and edges from random depth distribution
- Denoising process corrects distance relations with visual clues
- Ablation study conducted on NYU-Depth-V2 dataset to reveal properties of different inference settings
Conclusion
- Reformulated monocular depth estimation as a diffusion-denoising approach
- Proposed model reaches state-of-the-art performance
- Verified feasibility of introducing diffusion-denoising model into 3D perception tasks
- Iterative refinement of depth latent generates accurate and detailed depth maps
- Ablation study provided to give insights for follow-up works
- Diffusion process refines shapes and basic structure of depth map