Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Neural graphics primitives can be costly to train and evaluate.
  • A new input encoding is used to reduce cost without sacrificing quality.
  • The encoding uses a small neural network and a multiresolution hash table.
  • The system is implemented using fully-fused CUDA kernels.
  • Training is done in a matter of seconds and rendering in tens of milliseconds.

Paper Content

Introduction

  • Computer graphics primitives are represented by mathematical functions
  • Representations need to be fast and compact while capturing high-frequency detail
  • MLPs have been used as neural graphics primitives
  • Encoding maps neural network inputs to higher-dimensional space
  • Trainable, task-specific data structures used to enable smaller, more efficient MLPs
  • Multiresolution hash encoding is adaptive and efficient, independent of task
  • Hash tables prioritize sparse areas with most important fine scale detail
  • Hash table lookups are O(1) and don’t require control flow
  • One-hot encoding and kernel trick can be used to make complex data arrangements linearly separable
  • Frequency encodings have been used in computer graphics tasks such as approximating the visibility function
  • One-blob encoding can achieve more accurate results than frequency encodings in bounded domains

Parametric encodings.

  • Recently, state-of-the-art results have been achieved by parametric encodings which blur the line between classical data structures and neural approaches
  • Parameters are arranged in an auxiliary data structure, such as a grid or a tree
  • Computational cost is reduced as only a small number of parameters need to be updated for each sample
  • Parametric models can be trained to convergence faster without sacrificing accuracy
  • Dense grids of trainable features consume more memory than neural network weights
  • Frequency encoding allows a moderately sized network to represent a scene more accurately
  • Natural scenes exhibit smoothness, motivating the use of a multi-resolution decomposition
  • Sparse parametric encodings reduce waste by allocating fewer features to areas of empty space
  • Spatial hash table is used to reduce parameters while maintaining reconstruction quality
  • Neural network is used to disambiguate hash collisions
  • Hash tables have predictable memory layout and can be fine-tuned for low-level architectural details

Multiresolution hash encoding

  • Fully connected neural network with trainable weight and encoding parameters
  • Encoding parameters arranged into levels, each containing up to feature vectors with dimensionality
  • Hyperparameters typically shown in Table 1
  • Steps performed in multiresolution hash encoding illustrated in Figure 3
  • Feature vectors at each corner -linearly interpolated according to relative position of x within its hypercube
  • Performance vs. quality trade-off determined by hash table size
  • Implicit hash collision resolution due to different resolution levels having different strengths
  • Recommended = 2, = 16 as default
  • Interpolation ensures encoding and composition with neural network are continuous
  • Higher-order smoothness can be achieved with low-cost approach in Appendix A

Implementation

  • Implemented multiresolution hash encoding in CUDA
  • Source code released as an update to Müller [2021]
  • Hash table entries stored at half precision (2 bytes per entry)
  • Master copy of parameters in full precision for stable mixed-precision parameter updates
  • Computation scheduled to look up hash tables level by level
  • Performance remains roughly constant with hash table size ≤ 2^19
  • MLP with two hidden layers of width 64 neurons
  • Weights initialized according to Glorot and Bengio [2010]
  • Hash table entries initialized using uniform distribution U (−10^-4, 10^-4)
  • Jointly train neural network weights and hash table entries using Adam
  • L2 loss for gigapixel images and NeRF, MAPE for signed distance functions, luminance-relative L2 loss for neural radiance caching
  • Learning rate of 10^-4 for signed distance functions and 10^-2 otherwise
  • Skip Adam steps for hash table entries whose gradient is 0

Experiments

  • Encoding is versatile and of high quality
  • Compared to previous encodings
  • Used in four distinct computer graphics primitives
  • Primitives benefit from encoding spatial coordinates

Gigapixel image approximation

  • Learning the 2D to RGB mapping of image coordinates to colors is a popular benchmark for testing a model’s ability to represent high-frequency detail.
  • Adaptive coordinate networks (ACORN) have shown impressive results when fitting very large images.
  • Multiresolution hash encoding can achieve high-fidelity images in seconds to minutes.

Signed distance functions

  • Signed distance functions (SDFs) are used in many applications
  • DeepSDF uses a large MLP to represent one or more SDFs
  • Spatially learned encoding can be used with a smaller MLP
  • NGLOD uses a lookup from an octree of trainable feature vectors
  • Frequency encoding is used as a baseline
  • NGLOD achieves highest visual reconstruction quality
  • Our encoding approaches similar fidelity to NGLOD with similar performance and memory cost

Neural radiance caching

  • Neural Radiance Caching (NRC) is a task of predicting photorealistic pixel colors from feature buffers
  • MLP is run independently for each pixel, so feature buffers can be treated as per-pixel feature vectors
  • Multiresolution hash encoding is applied to 3D coordinate x and additional features are treated as auxiliary encoded dimensions
  • Visualizing the NRC at the first path vertex shows that the multiresolution hash encoding results in sharper reconstruction
  • Performance overhead of 0.7 ms reduces frame rate from 147 to 133 FPS at a resolution of 1920 × 1080px

Neural radiance and density fields (nerf)

  • NeRF setting represents a volumetric shape with a 3D density function and a 5D emission function
  • Model architecture consists of two concatenated MLPs: a density MLP and a color MLP
  • Ray marching is used to place samples near surfaces
  • Training and rendering can be done in seconds and at 60 FPS
  • Results compared to NeRF, mip-NeRF, and NSVF
  • Speedup attributed to hash encoding
  • Ablation shows MLP better at capturing specular effects and resolving hash collisions
  • MLP only 15% more expensive than linear layer

Mic

  • Our NeRF implementation with multiresolution hash encoding is competitive with NeRF, mip-NeRF, and NSVF after just 15 seconds of training.
  • Our method performs best on scenes with high geometric detail.
  • mip-NeRF and NSVF outperform our method on scenes with complex, view-dependent reflections.
  • Our hash encoding and smaller MLP cause a 20-60x improvement over competing implementations.
  • Our hash encoding allows for independent, fully parallel processing of each resolution.
  • We chose our hash function for its good mixture of efficiency, look-up coherence, and uniform coverage.
  • We prefer concatenation over reduction for the encoded result.
  • We experimented with three other hash functions, but none yielded a higher-quality reconstruction.
  • We believe the key to achieving state-of-the-art quality on SDFs with our encoding is to find a way to overcome microstructure caused by hash collisions.
  • We are interested in applying the multiresolution hash encoding to other low-dimensional tasks.
  • We have included a preliminary implementation that fits a radiance and density field directly from the noisy output of a volumetric path tracer.

Conclusion

  • Many graphics problems rely on task specific data structures
  • Our multi-resolution hash encoding provides a learning-based alternative
  • Low overhead allows it to be used in time-constrained settings
  • Speeds up NeRF by several orders of magnitude
  • Single-GPU training times measured in seconds
  • Smoothstep function used to reduce discontinuities
  • Offset each level by half of its voxel size to prevent zero derivatives
  • Octree sampling for NGLOD
  • Signed distance computed by triangle BVH and 32 stab rays
  • MLP prefixed by frequency encoding used as baseline
  • Ray marching step size set proportional to distance along ray
  • Skipping of empty space and occluded regions
  • Compaction of samples into dense buffers
  • Exponential stepping for large scenes
  • Warping of 3D domain not used
  • Occupancy grids used to skip ray marching steps