Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Recovery of scene geometry from multiview images is a challenge in computer vision research.
- Recent methods leverage neural implicit surface learning and differentiable volume rendering.
- Traditional multi-view stereo can recover geometry of scenes with rich textures.
- HelixSurf intertwines regularization from two strategies during learning process.
- HelixSurf is efficient and faster than existing methods.
Paper Content
Introduction
- Challenge in computer vision research: surface reconstruction from multi-view images
- Different paradigms of methods exist to address the challenge
- Multi-view stereo (MVS) methods recover properties of surface points by optimizing local pixel-wise correspondences
- Differentiable volume rendering connects observed images with neural modeling of implicit surface and radiance field
- HelixSurf combines MVS and neural implicit surface learning to regularize learning/optimization of one strategy using the other
- Regularizes learning on textureless surface areas by leveraging region-wise homogeneity of superpixels
- Adaptive point sampling along rays improves efficiency of differentiable volume rendering
Related works
Patchmatch based multi-view stereo
- 3D reconstruction from posed multi-view images is a challenging task in computer vision
- PatchMatch based Multi-view Stereo (PM-MVS) is traditionally the most explored technique
- PM-MVS methods represent the geometry with depth and/or normal maps
- Depth and/or normal of each pixel is estimated by exploiting inter-image photometric and geometric consistency
- Filtering operations are used to fuse all the depth maps into a global point cloud
- Meshing algorithms are used to recover complete surface
- Traditional methods have achieved great success but can produce artifacts and missing parts in textureless areas
- Deep learning-based MVS methods have demonstrated promising performance but rely on ground-truth 3D data for supervision
Neural implicit surface
- Recent works use neural networks to implicitly represent surfaces
- Neural networks output signed/unsigned distance fields or occupancy fields
- Surface rendering is used to reconstruct 3D shapes from 2D images
- Differentiable volume rendering techniques are used to eliminate the need of masks
- Follow-up works improve geometry quality with fine-grained surface details
- Deep networks suffer from smoothness bias which discourages them to regularize learning and recover fine details
- Recent works incorporate geometric cues from pre-trained models to get rid of this dilemma
- HelixSurf integrates traditional PM-MVS and neural implicit learning surface for better results
Preliminary
- Neural Implicit Surface Representation is a way to encode a continuous surface as the zero-level set of a signed distance field
- DeepSDF is a parameterized MLP used to represent the surface
- Differentiable volume rendering is used to synthesize novel views
- NeRF models a continuous scene space as a neural radiance field
- Volume density is modeled as a transformed function of the implicit SDF function
- Multi-View Stereo with PatchMatch is used to recover the scene geometry
- PatchMatch is used to establish pixel-wise correspondences across multiview images
- Color similarity and forward-backward reprojection error are used to evaluate the geometry
- Probability is used for view selection and Monte-Carlo view sampling is used to draw samples
Helixsurf for intertwined regularization of neural implicit surface learning
- Task is to reconstruct scene geometry with fine details
- MLP based radiance field function F connects scene geometry with image observations
- SDF-induced volume density used to reconstruct surface
- MLP based function has deep priors that induce continuous and piece-wise, smooth surface
- PatchMatch based MVS methods couple predictions of {d l , n l } for individual pixels in probabilistic framework
- Propose integrated solution that takes advantages of both strategies
- Iterative intertwined regularization used during learning process
Regularization of neural implicit surface learning from mvs predictions
- Neural implicit surface learning samples rays in 3D space
- Rays emanate from camera center and pass through a pixel in an image
- Loss defines color based image supervision from ray for learning
- Depth and surface normal can be computed from MVS prediction
Handling of textureless surface areas
- PatchMatch based MVS methods are reliable on texture-rich surface areas.
- Other sources are used to regularize neural implicit learning for textureless surface areas.
- Textureless surface areas tend to be homogeneous in color and geometrically smooth.
- Superpixels are used to further regularize neural implicit surface learning.
Regularization of multi-view stereo from neural implicit surface learning
- Equation 3 of MVS methods optimizes depth and normal predictions.
- Prior of P (d, n) is usually set as a uniformly random distribution.
- HelixSurf uses depth and normal learned in current iteration as prior.
- MVS methods with uniformly random distribution tend to produce noisy results with outliers.
- Proposed (8) gives better results.
Improving the efficiency by establishing dynamic space occupancies
- Differentiable volume rendering is computationally expensive.
- A coarse-to-fine sampling strategy is used to reduce cost.
- Proposed sampling scheme uses dynamic occupancies in 3D scene space to guide point sampling.
- Occupancy grids of size 64x3 are used to partition 3D scene space.
- Exponential moving average is used to update occupancy of voxels.
- Non-occupied voxels are skipped when performing point sampling.
- Scheme improves training efficiency by orders of magnitude.
Training and inference
- HelixSurf is a training process that randomly samples pixels from images.
- The camera rays passing through these pixels are divided into two sets: R and R.
- The MLP based functions f and c are optimized using an Eikonal loss and three hyperparameters.
- During inference, the marching cubes algorithm is used to extract the underlying surface from the learned SDF f.
Experiments
- Experiments conducted on ScanNet and Tanks and Temples datasets
- Implementation of HelixSurf in Py-Torch framework with CUDA extensions
- Adam optimizer used with learning rate of 1e-3
- 5000 rays sampled for each iteration
- Evaluation metrics for 3D reconstruction and MVS predictions
Comparisons
- Reconstruction metrics comparisons on ScanNet [7]
- HelixSurf surpasses existing methods in almost every metric
- HelixSurf produces better details of objects than those methods using auxiliary training data
- HelixSurf improves learning efficiency with orders of magnitude
- HelixSurf optimized with interactive intertwined regularization
- Sampling guided by dynamic occupancy grids
- MVS predictions effectively promote surface learning
- Leverage homogeneity inside individual superpixels to handle textureless surface areas
- Adaptive Kmeans clustering algorithm to extract principal normals
- Mesh-guided consistency on clustered normal maps
- Adaptively guide point sampling along rays by maintaining dynamic occupancy grids
- Initialize textureless surface areas with normals generated with Manhattan assumption
- PatchMatch based multi-view stereo (PM-MVS) method
- Ray casting technique
- Textureless triangle faces pruning
- Evaluation metrics: Accuracy, Completeness, Precision, Recall, and F-score
- Evaluation metrics for depth and normal map
- Results on ScanNet, Tanks & Temples, and DTU datasets