Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Volumetric scene representations enable photorealistic view synthesis for static scenes.
- Existing methods fail to simultaneously achieve real-time performance, small memory footprint, and high-quality rendering for challenging real-world scenes.
- HyperReel is a novel 6-DoF video representation.
- HyperReel has two core components: a ray-conditioned sample prediction network and a compact and memory efficient dynamic volume representation.
- HyperReel achieves best performance compared to prior and contemporary approaches in terms of visual quality with small memory requirements.
- HyperReel renders up to 18 frames-per-second at megapixel resolution without any custom CUDA code.
Paper Content
Introduction
- 6-DoF videos allow for free exploration of an environment
- View synthesis is the process of rendering new views from posed images or videos
- Recent works have made strides towards photorealistic view synthesis for static scenes
- Dynamic view synthesis is a challenging task
- Existing approaches take a long time to render a single image
- HyperReel is a novel 6-DoF video representation that is memory efficient and real-time renderable
- HyperReel uses a sample prediction network and a memory-efficient dynamic volume representation
- HyperReel outperforms existing works and provides high-quality renderings for non-Lambertian scenes
- HyperReel renders at up to 18 frames-per-second at megapixel resolution
Related work
- Novel view synthesis is the process of creating new views of a scene from a set of posed images
- Deep learning and neural fields are used to improve image-based rendering
- 3D scene representations augmented with appearance information can be used instead of image-based rendering
- NeRFs are a 3D scene representation for view synthesis
- NeRFs enable high-quality view synthesis but do not lend themselves to real-time rendering
- Recent works improve NeRFs by accounting for finite pixels and apertures, enabling application to unbounded scenes, and modifying the representation to allow for better reproduction of challenging view-dependent appearances
- Adaptive sampling for neural volume rendering can improve the speed of volumetric representations
- 6-DoF video allows users to explore new views within videos
- 6-DoF video can be reconstructed from single-view RGB sequences
- 6-DoF video from multi-view camera rigs is a challenging task that requires high visual quality, rendering speed, and memory efficiency
Method
- Problem of optimizing a volumetric representation for static view synthesis
- Volume representations like NeRF model the density and appearance of a static scene at every point in 3D space
- Function F θ maps position x and direction ω along a ray to a color L e (x, ω) and density σ(x)
- Parameters θ may be neural network weights, N -dimensional array entries, or a combination of both
- Render new views of a static scene with numerical quadrature by taking many sample points along a given ray
Sample networks for volume rendering
- Most scenes consist of solid objects whose surfaces lie on a 2D manifold within the 3D scene volume
- To accelerate volume rendering, a query is made for points with non-zero w k
- A feed-forward network is used to predict a set of sample locations x k
- The Plücker parameterization is used to represent the ray
- The sample prediction network maps a ray to the sample points x k
- Per-sample-point appearance features are converted to colors
- Per-sample-point opacities are also extracted
- A set of Tanh-activated per-sample-point offsets and scalar values are predicted to better represent challenging view-dependent appearance
Keyframe-based dynamic volumes
- We use Tensorial Radiance Fields (TensoRF) to represent a 3D scene volume in the static case.
- We extend TensoRF to a keyframe-based dynamic volume representation for the dynamic case.
- TensoRF factorizes a 3D volume as a set of outer products between functions of one or more spatial dimensions.
- We use a sample prediction network to take the current time as input and output velocities.
- We use a forward-Euler step to advect the sample points into the nearest keyframe.
- We query the keyframe-based volume with sample points and render the volume at time τ.
- HyperReel is our 6-DoF video representation.
Optimization
- Optimize representation using only training images
- Apply total variation and sparsity regularization to tensor components
- Sum loss over training rays and times
- Use subset of training rays to make optimization tractable
- Alternate between using all training rays and downsampled images for frame numbers divisible by 4
Experiments
- Implemented in PyTorch
- Single NVIDIA RTX 3090 GPU with 24 GB RAM
- 6-layer, 256-hidden unit MLP with Leaky ReLU activations
- 32 z-planes predicted for forward-facing scenes
- Radii of 32 spherical shells predicted for other settings
- Batch size of 16,384 rays for training
- Initial learning rate of 0.02 for keyframe-based volume
- Initial learning rate of 0.0075 for sample prediction network
- Trained for 1.5 hours each
Comparisons on static scenes
- DoNeRF dataset contains 6 synthetic sequences with 800x800 pixel resolution
- Our approach outperforms existing methods for static view synthesis
- Our model renders 800x800 pixel images at 6.5 FPS on a single RTX 3090 GPU
- Our approach outperforms R2L on the downsampled 400x400 resolution DoNeRF dataset
- LLFF dataset contains 8 real-world sequences with 1008x756 pixel images
- Our approach outperforms DoNeRF, AdaNeRF, TermiNeRF, and InstantNGP but achieves slightly worse quality than NeRF
- Our approach performs slightly worse than R2L on the downsampled 504x378 LLFF dataset
Comparisons on dynamic scenes
- Technicolor Dataset contains videos of varied indoor environments captured by a 4x4 camera rig
- Neural 3D Video dataset contains six indoor multi-view video sequences captured by 20 cameras
- Compared HyperReel to existing 3D video methods on three light-field/multi-view video datasets
- Google Immersive dataset contains light field videos of various indoor and outdoor environments captured by a 46fisheye camera rig
- Compared HyperReel to DeepView on the static Spaces dataset
- HyperReel outperforms all baseline approaches on all datasets
- Ablated HyperReel on the Technicolor light field dataset with different numbers of keyframes
- Compared HyperReel with and without sample prediction network and point offset
- HyperReel achieves a balance between high rendering quality, speed, and memory efficiency
- Limitations and future work discussed