Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Neural graphics primitives can be costly to train and evaluate.
- A new input encoding is used to reduce cost without sacrificing quality.
- The encoding uses a small neural network and a multiresolution hash table.
- The system is implemented using fully-fused CUDA kernels.
- Training is done in a matter of seconds and rendering in tens of milliseconds.
Paper Content
Introduction
- Computer graphics primitives are represented by mathematical functions
- Representations need to be fast and compact while capturing high-frequency detail
- MLPs have been used as neural graphics primitives
- Encoding maps neural network inputs to higher-dimensional space
- Trainable, task-specific data structures used to enable smaller, more efficient MLPs
- Multiresolution hash encoding is adaptive and efficient, independent of task
- Hash tables prioritize sparse areas with most important fine scale detail
- Hash table lookups are O(1) and don’t require control flow
Background and related work
- One-hot encoding and kernel trick can be used to make complex data arrangements linearly separable
- Frequency encodings have been used in computer graphics tasks such as approximating the visibility function
- One-blob encoding can achieve more accurate results than frequency encodings in bounded domains
Parametric encodings.
- Recently, state-of-the-art results have been achieved by parametric encodings which blur the line between classical data structures and neural approaches
- Parameters are arranged in an auxiliary data structure, such as a grid or a tree
- Computational cost is reduced as only a small number of parameters need to be updated for each sample
- Parametric models can be trained to convergence faster without sacrificing accuracy
- Dense grids of trainable features consume more memory than neural network weights
- Frequency encoding allows a moderately sized network to represent a scene more accurately
- Natural scenes exhibit smoothness, motivating the use of a multi-resolution decomposition
- Sparse parametric encodings reduce waste by allocating fewer features to areas of empty space
- Spatial hash table is used to reduce parameters while maintaining reconstruction quality
- Neural network is used to disambiguate hash collisions
- Hash tables have predictable memory layout and can be fine-tuned for low-level architectural details
Multiresolution hash encoding
- Fully connected neural network with trainable weight and encoding parameters
- Encoding parameters arranged into levels, each containing up to feature vectors with dimensionality
- Hyperparameters typically shown in Table 1
- Steps performed in multiresolution hash encoding illustrated in Figure 3
- Feature vectors at each corner -linearly interpolated according to relative position of x within its hypercube
- Performance vs. quality trade-off determined by hash table size
- Implicit hash collision resolution due to different resolution levels having different strengths
- Recommended = 2, = 16 as default
- Interpolation ensures encoding and composition with neural network are continuous
- Higher-order smoothness can be achieved with low-cost approach in Appendix A
Implementation
- Implemented multiresolution hash encoding in CUDA
- Source code released as an update to Müller [2021]
- Hash table entries stored at half precision (2 bytes per entry)
- Master copy of parameters in full precision for stable mixed-precision parameter updates
- Computation scheduled to look up hash tables level by level
- Performance remains roughly constant with hash table size ≤ 2^19
- MLP with two hidden layers of width 64 neurons
- Weights initialized according to Glorot and Bengio [2010]
- Hash table entries initialized using uniform distribution U (−10^-4, 10^-4)
- Jointly train neural network weights and hash table entries using Adam
- L2 loss for gigapixel images and NeRF, MAPE for signed distance functions, luminance-relative L2 loss for neural radiance caching
- Learning rate of 10^-4 for signed distance functions and 10^-2 otherwise
- Skip Adam steps for hash table entries whose gradient is 0
Experiments
- Encoding is versatile and of high quality
- Compared to previous encodings
- Used in four distinct computer graphics primitives
- Primitives benefit from encoding spatial coordinates
Gigapixel image approximation
- Learning the 2D to RGB mapping of image coordinates to colors is a popular benchmark for testing a model’s ability to represent high-frequency detail.
- Adaptive coordinate networks (ACORN) have shown impressive results when fitting very large images.
- Multiresolution hash encoding can achieve high-fidelity images in seconds to minutes.
Signed distance functions
- Signed distance functions (SDFs) are used in many applications
- DeepSDF uses a large MLP to represent one or more SDFs
- Spatially learned encoding can be used with a smaller MLP
- NGLOD uses a lookup from an octree of trainable feature vectors
- Frequency encoding is used as a baseline
- NGLOD achieves highest visual reconstruction quality
- Our encoding approaches similar fidelity to NGLOD with similar performance and memory cost
Neural radiance caching
- Neural Radiance Caching (NRC) is a task of predicting photorealistic pixel colors from feature buffers
- MLP is run independently for each pixel, so feature buffers can be treated as per-pixel feature vectors
- Multiresolution hash encoding is applied to 3D coordinate x and additional features are treated as auxiliary encoded dimensions
- Visualizing the NRC at the first path vertex shows that the multiresolution hash encoding results in sharper reconstruction
- Performance overhead of 0.7 ms reduces frame rate from 147 to 133 FPS at a resolution of 1920 × 1080px
Neural radiance and density fields (nerf)
- NeRF setting represents a volumetric shape with a 3D density function and a 5D emission function
- Model architecture consists of two concatenated MLPs: a density MLP and a color MLP
- Ray marching is used to place samples near surfaces
- Training and rendering can be done in seconds and at 60 FPS
- Results compared to NeRF, mip-NeRF, and NSVF
- Speedup attributed to hash encoding
- Ablation shows MLP better at capturing specular effects and resolving hash collisions
- MLP only 15% more expensive than linear layer
Mic
- Our NeRF implementation with multiresolution hash encoding is competitive with NeRF, mip-NeRF, and NSVF after just 15 seconds of training.
- Our method performs best on scenes with high geometric detail.
- mip-NeRF and NSVF outperform our method on scenes with complex, view-dependent reflections.
- Our hash encoding and smaller MLP cause a 20-60x improvement over competing implementations.
- Our hash encoding allows for independent, fully parallel processing of each resolution.
- We chose our hash function for its good mixture of efficiency, look-up coherence, and uniform coverage.
- We prefer concatenation over reduction for the encoded result.
- We experimented with three other hash functions, but none yielded a higher-quality reconstruction.
- We believe the key to achieving state-of-the-art quality on SDFs with our encoding is to find a way to overcome microstructure caused by hash collisions.
- We are interested in applying the multiresolution hash encoding to other low-dimensional tasks.
- We have included a preliminary implementation that fits a radiance and density field directly from the noisy output of a volumetric path tracer.
Conclusion
- Many graphics problems rely on task specific data structures
- Our multi-resolution hash encoding provides a learning-based alternative
- Low overhead allows it to be used in time-constrained settings
- Speeds up NeRF by several orders of magnitude
- Single-GPU training times measured in seconds
- Smoothstep function used to reduce discontinuities
- Offset each level by half of its voxel size to prevent zero derivatives
- Octree sampling for NGLOD
- Signed distance computed by triangle BVH and 32 stab rays
- MLP prefixed by frequency encoding used as baseline
- Ray marching step size set proportional to distance along ray
- Skipping of empty space and occluded regions
- Compaction of samples into dense buffers
- Exponential stepping for large scenes
- Warping of 3D domain not used
- Occupancy grids used to skip ray marching steps