Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Volumetric scene representations enable photorealistic view synthesis for static scenes.
  • Existing methods fail to simultaneously achieve real-time performance, small memory footprint, and high-quality rendering for challenging real-world scenes.
  • HyperReel is a novel 6-DoF video representation.
  • HyperReel has two core components: a ray-conditioned sample prediction network and a compact and memory efficient dynamic volume representation.
  • HyperReel achieves best performance compared to prior and contemporary approaches in terms of visual quality with small memory requirements.
  • HyperReel renders up to 18 frames-per-second at megapixel resolution without any custom CUDA code.

Paper Content

Introduction

  • 6-DoF videos allow for free exploration of an environment
  • View synthesis is the process of rendering new views from posed images or videos
  • Recent works have made strides towards photorealistic view synthesis for static scenes
  • Dynamic view synthesis is a challenging task
  • Existing approaches take a long time to render a single image
  • HyperReel is a novel 6-DoF video representation that is memory efficient and real-time renderable
  • HyperReel uses a sample prediction network and a memory-efficient dynamic volume representation
  • HyperReel outperforms existing works and provides high-quality renderings for non-Lambertian scenes
  • HyperReel renders at up to 18 frames-per-second at megapixel resolution
  • Novel view synthesis is the process of creating new views of a scene from a set of posed images
  • Deep learning and neural fields are used to improve image-based rendering
  • 3D scene representations augmented with appearance information can be used instead of image-based rendering
  • NeRFs are a 3D scene representation for view synthesis
  • NeRFs enable high-quality view synthesis but do not lend themselves to real-time rendering
  • Recent works improve NeRFs by accounting for finite pixels and apertures, enabling application to unbounded scenes, and modifying the representation to allow for better reproduction of challenging view-dependent appearances
  • Adaptive sampling for neural volume rendering can improve the speed of volumetric representations
  • 6-DoF video allows users to explore new views within videos
  • 6-DoF video can be reconstructed from single-view RGB sequences
  • 6-DoF video from multi-view camera rigs is a challenging task that requires high visual quality, rendering speed, and memory efficiency

Method

  • Problem of optimizing a volumetric representation for static view synthesis
  • Volume representations like NeRF model the density and appearance of a static scene at every point in 3D space
  • Function F θ maps position x and direction ω along a ray to a color L e (x, ω) and density σ(x)
  • Parameters θ may be neural network weights, N -dimensional array entries, or a combination of both
  • Render new views of a static scene with numerical quadrature by taking many sample points along a given ray

Sample networks for volume rendering

  • Most scenes consist of solid objects whose surfaces lie on a 2D manifold within the 3D scene volume
  • To accelerate volume rendering, a query is made for points with non-zero w k
  • A feed-forward network is used to predict a set of sample locations x k
  • The Plücker parameterization is used to represent the ray
  • The sample prediction network maps a ray to the sample points x k
  • Per-sample-point appearance features are converted to colors
  • Per-sample-point opacities are also extracted
  • A set of Tanh-activated per-sample-point offsets and scalar values are predicted to better represent challenging view-dependent appearance

Keyframe-based dynamic volumes

  • We use Tensorial Radiance Fields (TensoRF) to represent a 3D scene volume in the static case.
  • We extend TensoRF to a keyframe-based dynamic volume representation for the dynamic case.
  • TensoRF factorizes a 3D volume as a set of outer products between functions of one or more spatial dimensions.
  • We use a sample prediction network to take the current time as input and output velocities.
  • We use a forward-Euler step to advect the sample points into the nearest keyframe.
  • We query the keyframe-based volume with sample points and render the volume at time τ.
  • HyperReel is our 6-DoF video representation.

Optimization

  • Optimize representation using only training images
  • Apply total variation and sparsity regularization to tensor components
  • Sum loss over training rays and times
  • Use subset of training rays to make optimization tractable
  • Alternate between using all training rays and downsampled images for frame numbers divisible by 4

Experiments

  • Implemented in PyTorch
  • Single NVIDIA RTX 3090 GPU with 24 GB RAM
  • 6-layer, 256-hidden unit MLP with Leaky ReLU activations
  • 32 z-planes predicted for forward-facing scenes
  • Radii of 32 spherical shells predicted for other settings
  • Batch size of 16,384 rays for training
  • Initial learning rate of 0.02 for keyframe-based volume
  • Initial learning rate of 0.0075 for sample prediction network
  • Trained for 1.5 hours each

Comparisons on static scenes

  • DoNeRF dataset contains 6 synthetic sequences with 800x800 pixel resolution
  • Our approach outperforms existing methods for static view synthesis
  • Our model renders 800x800 pixel images at 6.5 FPS on a single RTX 3090 GPU
  • Our approach outperforms R2L on the downsampled 400x400 resolution DoNeRF dataset
  • LLFF dataset contains 8 real-world sequences with 1008x756 pixel images
  • Our approach outperforms DoNeRF, AdaNeRF, TermiNeRF, and InstantNGP but achieves slightly worse quality than NeRF
  • Our approach performs slightly worse than R2L on the downsampled 504x378 LLFF dataset

Comparisons on dynamic scenes

  • Technicolor Dataset contains videos of varied indoor environments captured by a 4x4 camera rig
  • Neural 3D Video dataset contains six indoor multi-view video sequences captured by 20 cameras
  • Compared HyperReel to existing 3D video methods on three light-field/multi-view video datasets
  • Google Immersive dataset contains light field videos of various indoor and outdoor environments captured by a 46fisheye camera rig
  • Compared HyperReel to DeepView on the static Spaces dataset
  • HyperReel outperforms all baseline approaches on all datasets
  • Ablated HyperReel on the Technicolor light field dataset with different numbers of keyframes
  • Compared HyperReel with and without sample prediction network and point offset
  • HyperReel achieves a balance between high rendering quality, speed, and memory efficiency
  • Limitations and future work discussed