Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Capturing and merging frames to enhance an image disregards 3D nature of scene
42 12-megapixel RAW frames captured in 2-second sequence can recover high-quality scene depth
Test-time optimization approach fits neural RGB-D representation to long-burst data
Plane plus depth model is trained end-to-end and performs coarse-to-fine refinement
Geometrically accurate depth reconstructions with no additional hardware or separate steps

Paper Content

Introduction

Rise and fall of film and DSLR photography
Cellphones offer high megapixel image streams
On-board motion measurement devices
Integrated active depth sensors
Depth imaging and 3D reconstruction
Depth can help with object understanding
Depth can help compensate for non-ideal camera hardware
Depth can be used for augmented reality and interactive experiences
Many ways to produce depth
Apple iPhone 12-14 Pro devices use depth derived from RGB
Existing passive monocular depth estimation methods
Multiview depth estimation methods
Neural radiance field approaches
Millimeter-scale view variation from natural hand tremor
Unsupervised end-to-end approach to estimate depth and camera motion
Jointly distill relative depth and pose estimates
Evaluations demonstrate approach outperforms existing methods

Depth estimation can be divided into active and passive methods
Active methods use controlled illumination to infer object shape or improve stereo feature matching
Time-of-flight (ToF) depth sensors use round trip time of photons to infer depth
Passive methods use correlation between visual and geometric features to estimate 3D structure
Multiview and structure from motion works leverage epipolar geometry to extract 3D information from multiple images
Neural scene representation works learn an implicit representation of a 3D scene by fitting a multi-layer perceptron to a set of input images
Our work uses a neural representation of RGB to distill high quality continuous representations of both depth and camera poses

Long-burst photography

Burst photography is a type of imaging where multiple frames are taken in rapid succession
Parameters such as ISO and exposure time can be varied during capture
Burst imaging pipelines investigate how these frames can be merged into a single higher-fidelity image
Video processing literature operates on sequences hundreds of frames in length and/or large camera motion
Long-burst photography is several seconds of continuous capture with small view variation
Data collection tool was designed to record two-second, 42 frame long-bursts
RAW images are preserved with 14-bit color depth and minimal processing

Unsupervised depth estimation

Proposed a method for depth estimation from long-burst data

Assessment

Comparing approach to BARF, SfSM, iPhone 14 Pro, MiDaS, RCVD, and HNDR
All baselines run on processed RGB data except HNDR
Assessing absolute performance and geometric consistency with 3D objects scanned by a commercial high-precision turntable structured light scanner
Outperforming existing learned, mixed, and multi-view only methods
Reconstructing small features such as Dragon’s tail, Harold’s scarf, and the ear of the Tiger statue
Plane plus depth offset approach avoids spurious depth solutions in low-parallax regions
Producing high-quality object reconstructions
Quantitative depth metrics outperform all comparison methods
RAW long-burst data can improve depth reconstruction compared to 8-bit RGB

Discussion and future work

It is possible to recover high-quality, geometrically-accurate object depth from a stack of images acquired during long-burst photography.
Approach could potentially be modified to include differentiable models of object motion, deformation, or occlusion.

Supplemental document

Additional results, implementation details, and ablation experiments in support of main text findings
Section A: Data, training, and evaluation details
Section B: Ablation studies on network and encoding parameters
Section C: Additional reconstruction results and examination of challenging imaging settings
Section D: Validation on simulated data with evaluation of motion estimates
Section E: Applications to depth and image matting

A. implementation details

Acquire long-burst data through custom app built on AVFoundation framework for iOS 16
Limited to four frame sequences with overhead between captures
Unable to stream RAW captures from multiple cameras simultaneously
Record Bayer CFA RAWs, processed RGB images, depth maps, frame timestamps, ISO, exposure time, brightness estimates, black level, white level, camera intrinsics, lens distortion tables, device acceleration estimates, device rotation estimates, and motion data timestamps
Estimate shade map with diffuser and uniform light source
Training with Adam optimizer, 256 iterations per epoch, 100 epochs
Evaluation with relative absolute error L1-rel and scale invariant error sc-inv metrics
Avoid photometric loss or reprojection error as comparison metrics

B. additional ablation experiments

Multiresolution hash encoding γD is used to control the scale of depth features reconstructed
Increasing the number of levels L γD and effective max resolution N γD max increases the spatial frequency of reconstructed depth features
Plane regularization weight α P affects depth reconstruction bidirectionally
Bézier curve model is used to represent translation between frames
Number of control points N c affects smoothness constraints on motion

C. additional reconstruction results

Multi-view depth estimation is performed through ray reprojection
Different scenes present different challenges
Dynamic scenes are difficult to reconstruct due to deformation
Textureless and distant scenes lack parallax information
Thin structures scene fails to reconstruct due to occlusions
Very high dynamic range scene needs HDR image volume and depth reconstruction
Lens blur scene needs depth-from-defocus cues

D. synthetic evaluation

Acquired high-fidelity structured light object scans for quantitative evaluation
Generated simulated long-burst captures
Applied Voronoi color texture to surface of meshes
Placed in front of tilted background plane with outdoor image texture
Added depth-of-field effects and matched camera intrinsics
Rendered frames at 16-bit color depth with Blender’s Eevee engine
Able to recover nearly ground truth reconstructions of both objects and background planes
High-contrast cues allow method to reconstruct tiny features
Camera motion estimates converge to nearly ground truth
Model distills high quality depth and camera motion models from long-burst image stack
Able to edit scenes by thresholding depth offset component

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Long-burst photography#

Unsupervised depth estimation#

Assessment#

Discussion and future work#

Supplemental document#

A. implementation details#

B. additional ablation experiments#

C. additional reconstruction results#

D. synthetic evaluation#

Link to paper

Abstract

Paper Content

Introduction

Related work

Long-burst photography

Unsupervised depth estimation

Assessment

Discussion and future work

Supplemental document

A. implementation details

B. additional ablation experiments

C. additional reconstruction results

D. synthetic evaluation