Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Monocular depth estimation is formulated using denoising diffusion models.
  • Innovations are introduced to address problems arising from noisy, incomplete depth maps.
  • Pre-training is leveraged to cope with limited availability of data.
  • DepthGen model achieves SOTA performance on the indoor NYU dataset and near SOTA results on the outdoor KITTI dataset.
  • DepthGen naturally represents depth ambiguity and has zero-shot performance combined with depth imputation.

Paper Content

Introduction

  • Diffusion probabilistic models are powerful for image synthesis
  • Adapted to monocular depth estimation
  • Training data is limited and noisy
  • Self-supervised pre-training used to capture image structure
  • Multi-task self-supervised pre-training followed by supervised fine-tuning
  • Outperforms SOTA baselines on NYU and competitive on KITTI
  • L 1 loss, depth infilling, and step-unrolled denoising diffusion improve performance
  • Multi-modal inference to resolve depth ambiguities
  • Depth imputation for text to 3D generation and novel view synthesis
  • Monocular depth estimation is essential for many vision applications
  • Recent progress in specialized loss functions and architectures has been impressive
  • This paper builds on this literature with a simple, generic architecture
  • Self-supervised tasks like colorization can be used for pre-training
  • Masked prediction has been found to be effective for self-supervised training
  • Large-scale in-domain pre-training has been effective for depth estimation
  • Diffusion models have been used for image generation and image enhancement

Background

  • Diffusion models are generative models that transform Gaussian noise into data.
  • The model has a forward process that adds noise and a reverse process that adds structure.
  • The model is conditioned on an RGB image and the target is a conditional distribution over depth maps.
  • The model uses a denoising network to predict a less noisy sample.
  • The training objective is a sum of non-linear regression losses.
  • For inference, a random noise sample is drawn and the denoising network is used to estimate the noise.

Self-supervised pre-training

  • DepthGen training uses self-supervised pre-training and supervised training on RGB-D data
  • Pre-trained model is a self-supervised multi-task diffusion model
  • Palette model is trained from scratch on four image-to-image translation tasks

Supervised training with noisy, incomplete depth

  • Training datasets for depth estimation present challenges due to noisy and incomplete depth maps
  • Diffusion models are particularly affected by incomplete depth maps
  • To reduce distribution shift between training and inference, missing depth values are imputed using nearest neighbor interpolation
  • Sky regions are set to the maximum modeled depth
  • Step-unrolled denoising diffusion is used during fine-tuning
  • L1 loss is used to increase robustness to noise

Experiments

Datasets

  • Used ImageNet-1K and Places365 datasets for unsupervised pre-training
  • Used ScanNet and ShapeNet datasets for supervised image-to-depth pre-training of indoor model
  • Used Waymo Open Dataset for outdoor model training
  • Used NYU depth v2 and KITTI datasets for fine-tuning and evaluation
  • Used random horizontal flip data augmentation for supervised depth training

Architecture

  • U-Net architecture is the predominant architecture for diffusion models
  • Efficient U-Net architecture is more efficient than U-Nets used in prior work
  • Efficient U-Net has six input and three output channels
  • For depth models, architecture is modified to have four input channels and one output channel

Hyper-parameters

  • Trained self-supervised model with L2 loss and mini-batch size of 512
  • Trained depth models with L1 loss and mini-batch size of 64
  • Used constant learning rate of 1e-4 during supervised depth pretraining and 3e-5 during fine-tuning
  • Used learning rate warm-up over 10k steps for all models
  • Trained indoor depth model on mix of ScanNet and SceneNet RGBD for 2M steps and fine-tuned on NYU for 50k steps
  • Trained outdoor depth model on Waymo for 0.9M steps and fine-tuned on KITTI for 50k steps

Sampler

  • Used DDPM ancestral sampler with 128 denoising steps
  • Increasing denoising steps did not improve performance
  • Not explored progressive distillation yet

Evaluation metrics

  • Standard evaluation protocol used in prior work (Li et al., 2022)
  • Reported metrics: absolute relative error (REL), root mean squared error (RMS), accuracy metrics (ฮด i < 1.25 i for i โˆˆ 1, 2, 3), absolute error of log depths (log 10 ), squared relative error (Sq-rel), root mean squared error of log depths (RMS log)

Results

  • State-of-the-art absolute relative error of 0.074 on NYU depth v2
  • Competitive performance on KITTI
  • Averaging depth maps from one or more samples
  • Pre-training and accounting for missing depth are crucial to model performance
  • Self-supervised pre-training and supervised depth pre-training are important
  • Depth infilling is important for outdoor KITTI dataset
  • L1 loss yields better performance than L2
  • Diffusion models can capture complex multimodal distributions
  • Zero-shot imputation of one part of an image conditioned on the rest
  • Text-to-3D scene generation pipeline
  • Self-supervised image-to-image pre-training
  • Supervised depth data training
  • Multimodal capability of diffusion models
  • Imputation during iterative refinement in diffusion models