Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Monocular depth estimation is formulated using denoising diffusion models.
Innovations are introduced to address problems arising from noisy, incomplete depth maps.
Pre-training is leveraged to cope with limited availability of data.
DepthGen model achieves SOTA performance on the indoor NYU dataset and near SOTA results on the outdoor KITTI dataset.
DepthGen naturally represents depth ambiguity and has zero-shot performance combined with depth imputation.

Paper Content

Introduction

Diffusion probabilistic models are powerful for image synthesis
Adapted to monocular depth estimation
Training data is limited and noisy
Self-supervised pre-training used to capture image structure
Multi-task self-supervised pre-training followed by supervised fine-tuning
Outperforms SOTA baselines on NYU and competitive on KITTI
L 1 loss, depth infilling, and step-unrolled denoising diffusion improve performance
Multi-modal inference to resolve depth ambiguities
Depth imputation for text to 3D generation and novel view synthesis

Monocular depth estimation is essential for many vision applications
Recent progress in specialized loss functions and architectures has been impressive
This paper builds on this literature with a simple, generic architecture
Self-supervised tasks like colorization can be used for pre-training
Masked prediction has been found to be effective for self-supervised training
Large-scale in-domain pre-training has been effective for depth estimation
Diffusion models have been used for image generation and image enhancement

Background

Diffusion models are generative models that transform Gaussian noise into data.
The model has a forward process that adds noise and a reverse process that adds structure.
The model is conditioned on an RGB image and the target is a conditional distribution over depth maps.
The model uses a denoising network to predict a less noisy sample.
The training objective is a sum of non-linear regression losses.
For inference, a random noise sample is drawn and the denoising network is used to estimate the noise.

Self-supervised pre-training

DepthGen training uses self-supervised pre-training and supervised training on RGB-D data
Pre-trained model is a self-supervised multi-task diffusion model
Palette model is trained from scratch on four image-to-image translation tasks

Supervised training with noisy, incomplete depth

Training datasets for depth estimation present challenges due to noisy and incomplete depth maps
Diffusion models are particularly affected by incomplete depth maps
To reduce distribution shift between training and inference, missing depth values are imputed using nearest neighbor interpolation
Sky regions are set to the maximum modeled depth
Step-unrolled denoising diffusion is used during fine-tuning
L1 loss is used to increase robustness to noise

Experiments

Datasets

Used ImageNet-1K and Places365 datasets for unsupervised pre-training
Used ScanNet and ShapeNet datasets for supervised image-to-depth pre-training of indoor model
Used Waymo Open Dataset for outdoor model training
Used NYU depth v2 and KITTI datasets for fine-tuning and evaluation
Used random horizontal flip data augmentation for supervised depth training

Architecture

U-Net architecture is the predominant architecture for diffusion models
Efficient U-Net architecture is more efficient than U-Nets used in prior work
Efficient U-Net has six input and three output channels
For depth models, architecture is modified to have four input channels and one output channel

Hyper-parameters

Trained self-supervised model with L2 loss and mini-batch size of 512
Trained depth models with L1 loss and mini-batch size of 64
Used constant learning rate of 1e-4 during supervised depth pretraining and 3e-5 during fine-tuning
Used learning rate warm-up over 10k steps for all models
Trained indoor depth model on mix of ScanNet and SceneNet RGBD for 2M steps and fine-tuned on NYU for 50k steps
Trained outdoor depth model on Waymo for 0.9M steps and fine-tuned on KITTI for 50k steps

Sampler

Used DDPM ancestral sampler with 128 denoising steps
Increasing denoising steps did not improve performance
Not explored progressive distillation yet

Evaluation metrics

Standard evaluation protocol used in prior work (Li et al., 2022)
Reported metrics: absolute relative error (REL), root mean squared error (RMS), accuracy metrics (δ i < 1.25 i for i ∈ 1, 2, 3), absolute error of log depths (log 10 ), squared relative error (Sq-rel), root mean squared error of log depths (RMS log)

Results

State-of-the-art absolute relative error of 0.074 on NYU depth v2
Competitive performance on KITTI
Averaging depth maps from one or more samples
Pre-training and accounting for missing depth are crucial to model performance
Self-supervised pre-training and supervised depth pre-training are important
Depth infilling is important for outdoor KITTI dataset
L1 loss yields better performance than L2
Diffusion models can capture complex multimodal distributions
Zero-shot imputation of one part of an image conditioned on the rest
Text-to-3D scene generation pipeline
Self-supervised image-to-image pre-training
Supervised depth data training
Multimodal capability of diffusion models
Imputation during iterative refinement in diffusion models

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Background#

Self-supervised pre-training#

Supervised training with noisy, incomplete depth#

Experiments#

Datasets#

Architecture#

Hyper-parameters#

Sampler#

Evaluation metrics#

Results#

Link to paper

Abstract

Paper Content

Introduction

Related work

Background

Self-supervised pre-training

Supervised training with noisy, incomplete depth

Experiments

Datasets

Architecture

Hyper-parameters

Sampler

Evaluation metrics

Results