Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Monocular depth estimation is formulated using denoising diffusion models.
- Innovations are introduced to address problems arising from noisy, incomplete depth maps.
- Pre-training is leveraged to cope with limited availability of data.
- DepthGen model achieves SOTA performance on the indoor NYU dataset and near SOTA results on the outdoor KITTI dataset.
- DepthGen naturally represents depth ambiguity and has zero-shot performance combined with depth imputation.
Paper Content
Introduction
- Diffusion probabilistic models are powerful for image synthesis
- Adapted to monocular depth estimation
- Training data is limited and noisy
- Self-supervised pre-training used to capture image structure
- Multi-task self-supervised pre-training followed by supervised fine-tuning
- Outperforms SOTA baselines on NYU and competitive on KITTI
- L 1 loss, depth infilling, and step-unrolled denoising diffusion improve performance
- Multi-modal inference to resolve depth ambiguities
- Depth imputation for text to 3D generation and novel view synthesis
Related work
- Monocular depth estimation is essential for many vision applications
- Recent progress in specialized loss functions and architectures has been impressive
- This paper builds on this literature with a simple, generic architecture
- Self-supervised tasks like colorization can be used for pre-training
- Masked prediction has been found to be effective for self-supervised training
- Large-scale in-domain pre-training has been effective for depth estimation
- Diffusion models have been used for image generation and image enhancement
Background
- Diffusion models are generative models that transform Gaussian noise into data.
- The model has a forward process that adds noise and a reverse process that adds structure.
- The model is conditioned on an RGB image and the target is a conditional distribution over depth maps.
- The model uses a denoising network to predict a less noisy sample.
- The training objective is a sum of non-linear regression losses.
- For inference, a random noise sample is drawn and the denoising network is used to estimate the noise.
Self-supervised pre-training
- DepthGen training uses self-supervised pre-training and supervised training on RGB-D data
- Pre-trained model is a self-supervised multi-task diffusion model
- Palette model is trained from scratch on four image-to-image translation tasks
Supervised training with noisy, incomplete depth
- Training datasets for depth estimation present challenges due to noisy and incomplete depth maps
- Diffusion models are particularly affected by incomplete depth maps
- To reduce distribution shift between training and inference, missing depth values are imputed using nearest neighbor interpolation
- Sky regions are set to the maximum modeled depth
- Step-unrolled denoising diffusion is used during fine-tuning
- L1 loss is used to increase robustness to noise
Experiments
Datasets
- Used ImageNet-1K and Places365 datasets for unsupervised pre-training
- Used ScanNet and ShapeNet datasets for supervised image-to-depth pre-training of indoor model
- Used Waymo Open Dataset for outdoor model training
- Used NYU depth v2 and KITTI datasets for fine-tuning and evaluation
- Used random horizontal flip data augmentation for supervised depth training
Architecture
- U-Net architecture is the predominant architecture for diffusion models
- Efficient U-Net architecture is more efficient than U-Nets used in prior work
- Efficient U-Net has six input and three output channels
- For depth models, architecture is modified to have four input channels and one output channel
Hyper-parameters
- Trained self-supervised model with L2 loss and mini-batch size of 512
- Trained depth models with L1 loss and mini-batch size of 64
- Used constant learning rate of 1e-4 during supervised depth pretraining and 3e-5 during fine-tuning
- Used learning rate warm-up over 10k steps for all models
- Trained indoor depth model on mix of ScanNet and SceneNet RGBD for 2M steps and fine-tuned on NYU for 50k steps
- Trained outdoor depth model on Waymo for 0.9M steps and fine-tuned on KITTI for 50k steps
Sampler
- Used DDPM ancestral sampler with 128 denoising steps
- Increasing denoising steps did not improve performance
- Not explored progressive distillation yet
Evaluation metrics
- Standard evaluation protocol used in prior work (Li et al., 2022)
- Reported metrics: absolute relative error (REL), root mean squared error (RMS), accuracy metrics (ฮด i < 1.25 i for i โ 1, 2, 3), absolute error of log depths (log 10 ), squared relative error (Sq-rel), root mean squared error of log depths (RMS log)
Results
- State-of-the-art absolute relative error of 0.074 on NYU depth v2
- Competitive performance on KITTI
- Averaging depth maps from one or more samples
- Pre-training and accounting for missing depth are crucial to model performance
- Self-supervised pre-training and supervised depth pre-training are important
- Depth infilling is important for outdoor KITTI dataset
- L1 loss yields better performance than L2
- Diffusion models can capture complex multimodal distributions
- Zero-shot imputation of one part of an image conditioned on the rest
- Text-to-3D scene generation pipeline
- Self-supervised image-to-image pre-training
- Supervised depth data training
- Multimodal capability of diffusion models
- Imputation during iterative refinement in diffusion models