Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Presents I$^2$-SDF, a method for intrinsic indoor scene reconstruction and editing
  • Uses differentiable Monte Carlo raytracing on neural signed distance fields
  • Jointly recovers shapes, incident radiance and materials from multi-view images
  • Introduces a novel bubble loss and error-guided adaptive sampling scheme
  • Decomposes neural radiance field into spatially-varying material of the scene
  • Demonstrates superior quality on indoor scene reconstruction, novel view synthesis, and scene editing

Paper Content

Introduction

  • Reconstructing 3D scenes from multi-view images is a fundamental task in computer science
  • Neural Radiance Field (NeRF) uses MLPs to approximate the underlying geometry and appearance of a 3D scene
  • Novel view synthesis is insufficient for scene editing applications
  • Inverse rendering or intrinsic decomposition reconstructs and decomposes the scene into shape, shading and surface reflectance
  • Complex indoor scenes are difficult to reconstruct
  • I 2 -SDF is a new method to decompose a 3D scene into its underlying shape, material, and incident radiance components
  • Bubble loss and error-guided adaptive sampling improve reconstruction quality on small objects
  • I 2 -SDF enables photorealistic indoor scene relighting and editing
  • High-quality synthetic indoor scene multi-view dataset provided
  • Neural implicit scene representations are used to represent 3D geometry and radiance information
  • Neural radiance field (NeRF) uses a single MLP to encode a scene as a continuous volumetric field
  • Follow-up works accelerate reconstruction speed using voxels, hashgrids or deep image features
  • Neural fields can also be applied to represent 3D geometric functions
  • Difficulties in handling shape-radiance ambiguity on texture-less surfaces
  • Traditional multi-view stereo methods struggle with texture-less regions
  • Learning-based MVS methods divided into two categories: depth-based and TSDF-based
  • Neural implicit SDF methods used to tackle texture-less regions
  • Inverse rendering attempts to reconstruct and factorize the scene with geometry, material and lighting
  • Neural implicit representations used to estimate BRDF and lighting from image collections
  • Recent methods mainly focus on single object reconstruction and do not handle spatially-varying lighting conditions

Overview

  • Goal is to decompose shape, radiance and material of indoor scene according to multi-view input images
  • Implicit representations used to model geometry, radiance and material
  • Pipeline consists of neural SDF field, neural radiance field, neural material fields and emission field
  • Two-stage training scheme used to avoid training ambiguities

Implicit neural surface representation and volume rendering

  • Represent scene geometry as an implicit signed distance function (SDF)
  • SDF maps 3D point to closest distance to surface
  • Parameterize SDF and scene appearance as MLP
  • Use differentiable volume rendering to learn scene implicit representation from images
  • Color, depth, and normal of surface can be accumulated

Intrinsics decomposition

  • Indoor scenes contain objects of different scales and visibility levels
  • Existing indoor reconstruction methods often fail to recognize and reconstruct thin or suspended objects
  • Neural networks tend to converge faster on low-frequency information than high-frequency information
  • Gradients for small objects can vanish due to the nature of neural networks
  • To address this problem, “bubbles” are inserted to create gradients for SDF near small or thin objects
  • Bubble loss is used to minimize the absolute SDF value of surface points
  • Importance sampling algorithm is used to filter out large planar areas and preserve small-object areas
  • Geometry loss is used to approximate the geometry field
  • Depth and normal priors are used to handle shape-radiance ambiguity
  • Smoothness loss is used to encourage smooth surface reconstruction

Emitter semantic field

  • Radiance field F c is trained from LDR images, causing under-estimation of light intensity from emitters.
  • Neural emitter semantic field F e is introduced to optimize radiance value emitted from light sources.
  • K-Means algorithm is used to cluster emitter points and an array L[•] is defined to model HDR emissions.

Material field

  • Parameterize spatially-varying material of scene as neural field
  • Use GGX microfacet BRDF model to present scene material
  • Model albedo and roughness of scene with two MLPs
  • Estimate material parameter associated with ray using volumetric accumulation
  • Enforce physical correctness for predicted material parameters with regularizations

Differentiable monte carlo raytracing

  • Scene appearance can be re-rendered using surface rendering algorithms
  • Raytracing is performed in a 3D volumetric space
  • Monte Carlo rendering technique is used to perform the scene re-rendering
  • Surface color is rendered by Monte Carlo integration

Training

  • Training of geometry and radiance fields is done in an end-to-end manner
  • Loss is calculated between 3D neural representation and 2D images
  • Training is done in 3 steps: warm-up, bubble, and smooth
  • Training of material and emission fields is done after geometry and radiance networks are pretrained
  • Material network is weakly supervised by re-rendering results

Experiments

  • Analyze and compare method with state-of-the-art methods
  • Demonstrate qualitative scene editing and relighting results
  • Perform ablation studies to prove effectiveness
  • Propose new synthetic multi-view indoor scene dataset
  • Compare on mesh-based metrics and image-space geometry errors
  • Compare against state-of-the-art neural reconstruction, multi-view stereo and novel view synthesis methods
  • Our method outperforms all baselines

Scene editing

  • Decomposition results enable photo-realistic scene editing tasks such as material editing and relighting
  • Physically-based rendering algorithm produces photo-realistic lighting effects such as specular reflections
  • Robustness on inaccurate depth information
  • Effectiveness of adaptive sampling strategy
  • Proposes I2-SDF to reconstruct an intrinsic neural scene from multi-view images
  • Novel bubbling strategy recovers small objects in large-scale scenes
  • Limitations include MLP-based network backbone and time-consuming MC raytracing
  • Calculate PDF value corresponding to view and lighting direction
  • Visualization of error-guided sampling map during training
  • Outperforms baselines in novel view synthesis
  • Raytracing algorithm casts shadows of inserted object
  • Quantitative comparisons of novel view synthesis and geometric reconstruction results