Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Multiplane Image (MPI) is an effective and efficient representation for view synthesis from sparse inputs.
- Structural MPI (S-MPI) approximates 3D scenes concisely and bridges view synthesis and 3D reconstruction.
- Challenges include high-fidelity approximation, multi-view consistency, non-planar regions modeling, and efficient rendering.
- Transformer-based network proposed to predict compact and expressive S-MPI layers.
- Experiments show method outperforms previous state-of-the-art MPI-based view synthesis methods and planar reconstruction methods.
Paper Content
Introduction
- Aim to generate new images from transformed viewpoints
- Advance of neural networks drives progress
- NeRF-based methods have limitations
- MPI representation has superior abilities
- Neural networks used to construct MPI layers
- Novel views rendered in real-time
- Standard MPI has underlying limitations
- Sensitive to discretization and introduces redundancy
- Construct MPIs adaptive to 3D scenes
- End-to-end planar reconstruction from images
- Introduce Structural MPI representation
- End-to-end network to construct S-MPI
- Global proxy embeddings encode full 3D scene
Related works
- View synthesis with explicit representations
- Layered Depth Image (LDI) uses several layers of depth maps and associated color values
- Multiplane Image (MPI) is a popular variant of LDI
- Structural MPI overcomes the MPI’s drawbacks
- View synthesis with implicit representations
- NeRF encodes 3D objects and scenes in the weights of an MLP
- Planar 3D Approximation
- Piece-wise planar depth map reconstruction is a traditional research topic
- Detection-based framework generates plane segments and 3D plane parameters
Structural multiplane image
- Introduces structural multiplane image representation
- Geometry formulation based on standard MPI formulation
Geometry formulation
- MPI consists of a collection of planes parallel to the image plane
- Non-planar regions are represented as Non-planar regions
- Fronto-parallel MPI is a special case of S-MPI where all planes have the same normal as the image plane
- Non-planar regions are distributed into nearby fronto-parallel planes
- S-MPI contains hybrid structures of geometrically faithful planes and depth-adaptive fronto-parallel planes
- Final S-MPI for the complete scene is extended with N n elements with a fixed normal n z
Rendering formulation
- Standard MPI has a global back-to-front rendering order of planes for each pixel
- S-MPI has different rendering orders for pixels due to planes intersecting with each other
- Calculate depth values for each pixel on each plane
- Rearrange RGBα images with the depth order for each pixel
- Render novel views by transforming plane parameters from the source view
Structural multiplane image transformer
- Goal is to construct S-MPI representation for novel view synthesis and planar reconstruction
- Model predicts S-MPI with appropriate number of posed planes and RGBα layers
- Model inspired by neural planar reconstruction methods
- Model differentiates non-planar instances according to their depth range
- Model predicts proxies with structure class and visible projected mask of the plane
Single-view network
- Model is built on a universal image segmentation network with two branches
- One branch upsamples features and generates high-resolution per-pixel embeddings
- The other branch generates N proxy embeddings at the instance level
- Loss function jointly optimizes view synthesis and planar estimation
- Loss function includes RGB L1 loss, SSIM loss, cross-entropy loss, focal loss, dice loss, and L1 loss
Multi-view network
- Multi-view input can enlarge the range of view synthesis
- Single-view methods generating planar reconstructions are not aligned in 3D space
- Goal is to deliver multi-view consistent planar reconstruction
- Network design uses global proxy embeddings across views
- Multi-view alignment uses global plane poses to align instances in each view
- RGBα embeddings and segmentation embeddings are shared globally
- Image merging fuses rendered images with alpha weights as confidence maps
Experiments
- S-MPI is a computer science tool that is effective for view synthesis and reconstruction from single-view and multi-view settings.
- S-MPI is tested on two datasets, NYUv2 and ScanNet, and outperforms other methods.
Implementation
- Ground truth labels of plane instance masks and plane parameters used from [25]
- Non-planar regions sampled S-MPIs according to depth range of scene
- Images resized to 256x384 for training and evaluation
- ResNet50 used as backbone
- Adam Optimizer used with initial learning rate of 0.0001 and weight decay of 0.05
- Bootstrap training phase of 50k steps
- Model trained on 4 NVIDIA-V100 GPUs for 100k steps
- Multi-view input with T=2
Single-view evaluation
- Planar reconstruction methods are evaluated using planar estimation metrics and segmentation metrics
- Depth estimation methods are compared with planar reconstruction methods and MPI-based view synthesis methods
- View synthesis methods are compared with standard MPI-based methods
- Results show that the proposed method achieves better reconstruction and view synthesis performance
Multi-view evaluation
- Compared PlaneMVS on plane detection and depth reconstruction accuracy
- Measured plane detection by Average Precision with IoU 0.5 and varying depth error
- Compared single-view planar segmentation results on ScanNet
- Followed data settings from DP-NeRF to sample images sparsely and generate dense depth maps
- Trained on two nearest images and compared results to MPI-based and NeRF-based methods
- Outperformed MPI-based methods and achieved comparable results to NeRF-based methods
- Compared rendering speed with NeRF-based method
Ablation study
- Multi-view model trained with input image pairs with gap of < 20 frames and view synthesis target image < 30 frames away from middle of two input images
- Global proxy embedding strategy produces better aligned images than processing multi-view images one by one
- Method benefits from consistent multi-view geometry supervision and produces more accurate planar reconstruction than given two identical images
- Performance drops if view gap is too large
- Method performs better in scenes with more planar coverage
Conclusion
- Introduction of Structural MPI (S-MPI) representation for neural view synthesis and 3D reconstruction
- End-to-end model proposed to construct S-MPI
- Global alignment scheme to generate aligned images for view synthesis
- Limitations and future work
- Need ground-truth plane segmentation and poses
- Capability to use multiple parallel layers to simulate non-Lambertian effects
- Figures 1-8 to illustrate S-MPI, formulation, rendering, transformer, embedding, reconstruction, synthesis, and ablation study
- Quantitative view synthesis results on ScanNet