Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Proposes a new challenge for synthesizing a novel view in a practical environment with limited input images and significant illumination variations
  • Suggests ExtremeNeRF to address the problem, which utilizes occlusion-aware multiview albedo consistency, supported by geometric alignment and depth consistency
  • Extracts intrinsic image components that are illumination-invariant across different views
  • Provides extensive experimental results for an evaluation of the task using the newly built NeRF Extreme benchmark

Paper Content


  • Neural radiance fields (NeRF) have had significant impacts on 3D scene reconstruction and novel view synthesis
  • Various aspects of NeRF have been improved, such as generalization ability, representation ability, and practicality
  • ExtremeNeRF provides reliable novel view synthesis results compared to the state-of-the-art method
  • NeRF often struggles to render large-size patches due to complexity
  • ExtremeNeRF enforces consistency among intrinsic components between input and rendered views
  • A new NeRF Extreme dataset has been proposed, which is the first in-the-wild multi-view dataset with both indoor and outdoor scenes taken under varying illumination
  • ExtremeNeRF provides plausible few-shot view synthesis and video rendering results with fewer inputs and varying illumination
  • Few-shot NeRF leverages knowledge priors to improve performance
  • Depth and geometry priors are used
  • NeRF-W enables view synthesis of internet photos taken under varying illumination
  • Tancik et al. presented a way to synthesize a large-scale city scene captured under varying lighting conditions
  • Existing datasets focus on varying single attributes like viewing direction or illumination
  • Li et al. disentangled intrinsic components without supervision
  • Ye et al. suggested a NeRF framework for scene editing


  • Neural radiance fields (NeRF) is a volume rendering-based view-synthesis framework
  • NeRF maps 5D inputs to color and volume density
  • Vanila-NeRF view synthesis is done by optimizing mean squared error on synthesized color
  • RegNeRF uses a depth smoothness regularization for few-shot view synthesis
  • RegNeRF relies on the assumption that input images should share consistent illumination conditions



  • Objective: Build illumination-robust few-shot view synthesis framework
  • Challenges: Geometric alignment, occluded regions, global context
  • Solution: Offline intrinsic decomposition network, pseudoalbedo ground truth

Intrinsic consistency regularization

  • Two images are randomly selected from a set of inputs and novel views for each iteration
  • A projective transformation is used to get a pixel correspondence between the two images
  • Albedo consistency is imposed between inputs and novel views
  • Occlusion handling is used to minimize projection errors
  • Depth consistency loss is used to regularize the scene geometry and reduce floating artifacts

Albedo estimation

  • Intrinsic decomposition requires global context of a scene
  • NeRF struggles with large-resolution inputs
  • Intrinsic rendering methods cannot be used due to lack of input data
  • Proposed two-stage intrinsic decomposition pipeline: FIDNet and PIDNet
  • FIDNet extracts intrinsic components of input images offline
  • PIDNet extracts patch-wise intrinsic components of synthesized color patch at novel view

Total loss functions

  • Edge-preserving loss ensures gradients of the novel view are the same as the input view
  • Intrinsic smoothness loss and depth consistency loss are formulated in the same way
  • Chromaticity consistency loss ensures the chromaticity of the input patch and the extracted albedo are the same


  • Introducing NeRF Extreme, a multi-view dataset with varying illumination
  • Scenes are not limited to object-centric ones
  • Details of dataset statistics and experimental settings in supplementary material

Nerf extreme dataset

  • Collected multi-view images with a variety of light sources
  • Varied illumination by turn-on/off light sources and closing/opening curtains
  • Captured outdoor scenes at different times with different sunlights
  • Images taken in the wild using off-the-shelves camera on mobile phone
  • Camera poses obtained using COLMAP structure-from-motion framework
  • Depth maps obtained by multi-view stereo method

Experimental settings

  • Framework based on JAX and RegNeRF
  • Used IIDWW code and model without fine-tuning
  • Experiments on DTU, NeRD, LLFF, and NeRF Extreme datasets
  • Compared to mip-NeRF and RegNeRF for few-shot view synthesis
  • Weak comparisons with NeRF-W, NeROIC, and other works

Evaluation metrics

  • Problem of evaluating novel view synthesis under varying illumination is ill-posed
  • Synthesizing scene appearances is impossible with input information
  • Evaluation methods in addition to PSNR and SSIM to compare underlying characteristics of scene regardless of illumination
  • Metric to measure maintenance of consistent color from synthesized image to inputs
  • Absolute Relative Error to compare quality of synthesized depth

Experimental results

  • Proposed ExtremeNeRF outperforms other few-shot view synthesis baselines on NeRF Extreme and light-varying DTU datasets
  • ExtremeNeRF eliminates distortions from varying illumination inputs
  • Cross-view consistency regularization helps maintain underlying color consistency and depth map
  • Albedo consistency regularization helps regularize depth
  • Occlusion handling prevents consistency regularization between pixels with occlusions
  • Results prove ExtremeNeRF can maintain physical properties of target scene with sparse inputs and varying illumination
  • Quantitative comparison shows ExtremeNeRF has lowest CCRD and Abs Rel
  • Ablation studies show depth consistency term is beneficial for color and depth map
  • NeROIC-Geom and NeROIC-Full show over-smoothed or diverged results on NeRF Extreme dataset
  • Relighting results achieved by replacing shading image of synthesized scene
  • Tent scene results show difficulties in regularizing intrinsic components
  • Additional qualitative comparisons show ExtremeNeRF performs better than other methods in synthesizing depth