Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Capturing and merging frames to enhance an image disregards 3D nature of scene
  • 42 12-megapixel RAW frames captured in 2-second sequence can recover high-quality scene depth
  • Test-time optimization approach fits neural RGB-D representation to long-burst data
  • Plane plus depth model is trained end-to-end and performs coarse-to-fine refinement
  • Geometrically accurate depth reconstructions with no additional hardware or separate steps

Paper Content

Introduction

  • Rise and fall of film and DSLR photography
  • Cellphones offer high megapixel image streams
  • On-board motion measurement devices
  • Integrated active depth sensors
  • Depth imaging and 3D reconstruction
  • Depth can help with object understanding
  • Depth can help compensate for non-ideal camera hardware
  • Depth can be used for augmented reality and interactive experiences
  • Many ways to produce depth
  • Apple iPhone 12-14 Pro devices use depth derived from RGB
  • Existing passive monocular depth estimation methods
  • Multiview depth estimation methods
  • Neural radiance field approaches
  • Millimeter-scale view variation from natural hand tremor
  • Unsupervised end-to-end approach to estimate depth and camera motion
  • Jointly distill relative depth and pose estimates
  • Evaluations demonstrate approach outperforms existing methods
  • Depth estimation can be divided into active and passive methods
  • Active methods use controlled illumination to infer object shape or improve stereo feature matching
  • Time-of-flight (ToF) depth sensors use round trip time of photons to infer depth
  • Passive methods use correlation between visual and geometric features to estimate 3D structure
  • Multiview and structure from motion works leverage epipolar geometry to extract 3D information from multiple images
  • Neural scene representation works learn an implicit representation of a 3D scene by fitting a multi-layer perceptron to a set of input images
  • Our work uses a neural representation of RGB to distill high quality continuous representations of both depth and camera poses

Long-burst photography

  • Burst photography is a type of imaging where multiple frames are taken in rapid succession
  • Parameters such as ISO and exposure time can be varied during capture
  • Burst imaging pipelines investigate how these frames can be merged into a single higher-fidelity image
  • Video processing literature operates on sequences hundreds of frames in length and/or large camera motion
  • Long-burst photography is several seconds of continuous capture with small view variation
  • Data collection tool was designed to record two-second, 42 frame long-bursts
  • RAW images are preserved with 14-bit color depth and minimal processing

Unsupervised depth estimation

  • Proposed a method for depth estimation from long-burst data

Assessment

  • Comparing approach to BARF, SfSM, iPhone 14 Pro, MiDaS, RCVD, and HNDR
  • All baselines run on processed RGB data except HNDR
  • Assessing absolute performance and geometric consistency with 3D objects scanned by a commercial high-precision turntable structured light scanner
  • Outperforming existing learned, mixed, and multi-view only methods
  • Reconstructing small features such as Dragon’s tail, Harold’s scarf, and the ear of the Tiger statue
  • Plane plus depth offset approach avoids spurious depth solutions in low-parallax regions
  • Producing high-quality object reconstructions
  • Quantitative depth metrics outperform all comparison methods
  • RAW long-burst data can improve depth reconstruction compared to 8-bit RGB

Discussion and future work

  • It is possible to recover high-quality, geometrically-accurate object depth from a stack of images acquired during long-burst photography.
  • Approach could potentially be modified to include differentiable models of object motion, deformation, or occlusion.

Supplemental document

  • Additional results, implementation details, and ablation experiments in support of main text findings
  • Section A: Data, training, and evaluation details
  • Section B: Ablation studies on network and encoding parameters
  • Section C: Additional reconstruction results and examination of challenging imaging settings
  • Section D: Validation on simulated data with evaluation of motion estimates
  • Section E: Applications to depth and image matting

A. implementation details

  • Acquire long-burst data through custom app built on AVFoundation framework for iOS 16
  • Limited to four frame sequences with overhead between captures
  • Unable to stream RAW captures from multiple cameras simultaneously
  • Record Bayer CFA RAWs, processed RGB images, depth maps, frame timestamps, ISO, exposure time, brightness estimates, black level, white level, camera intrinsics, lens distortion tables, device acceleration estimates, device rotation estimates, and motion data timestamps
  • Estimate shade map with diffuser and uniform light source
  • Training with Adam optimizer, 256 iterations per epoch, 100 epochs
  • Evaluation with relative absolute error L1-rel and scale invariant error sc-inv metrics
  • Avoid photometric loss or reprojection error as comparison metrics

B. additional ablation experiments

  • Multiresolution hash encoding γD is used to control the scale of depth features reconstructed
  • Increasing the number of levels L γD and effective max resolution N γD max increases the spatial frequency of reconstructed depth features
  • Plane regularization weight α P affects depth reconstruction bidirectionally
  • Bézier curve model is used to represent translation between frames
  • Number of control points N c affects smoothness constraints on motion

C. additional reconstruction results

  • Multi-view depth estimation is performed through ray reprojection
  • Different scenes present different challenges
  • Dynamic scenes are difficult to reconstruct due to deformation
  • Textureless and distant scenes lack parallax information
  • Thin structures scene fails to reconstruct due to occlusions
  • Very high dynamic range scene needs HDR image volume and depth reconstruction
  • Lens blur scene needs depth-from-defocus cues

D. synthetic evaluation

  • Acquired high-fidelity structured light object scans for quantitative evaluation
  • Generated simulated long-burst captures
  • Applied Voronoi color texture to surface of meshes
  • Placed in front of tilted background plane with outdoor image texture
  • Added depth-of-field effects and matched camera intrinsics
  • Rendered frames at 16-bit color depth with Blender’s Eevee engine
  • Able to recover nearly ground truth reconstructions of both objects and background planes
  • High-contrast cues allow method to reconstruct tiny features
  • Camera motion estimates converge to nearly ground truth
  • Model distills high quality depth and camera motion models from long-burst image stack
  • Able to edit scenes by thresholding depth offset component