Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Multiplane Image (MPI) is an effective and efficient representation for view synthesis from sparse inputs.
  • Structural MPI (S-MPI) approximates 3D scenes concisely and bridges view synthesis and 3D reconstruction.
  • Challenges include high-fidelity approximation, multi-view consistency, non-planar regions modeling, and efficient rendering.
  • Transformer-based network proposed to predict compact and expressive S-MPI layers.
  • Experiments show method outperforms previous state-of-the-art MPI-based view synthesis methods and planar reconstruction methods.

Paper Content

Introduction

  • Aim to generate new images from transformed viewpoints
  • Advance of neural networks drives progress
  • NeRF-based methods have limitations
  • MPI representation has superior abilities
  • Neural networks used to construct MPI layers
  • Novel views rendered in real-time
  • Standard MPI has underlying limitations
  • Sensitive to discretization and introduces redundancy
  • Construct MPIs adaptive to 3D scenes
  • End-to-end planar reconstruction from images
  • Introduce Structural MPI representation
  • End-to-end network to construct S-MPI
  • Global proxy embeddings encode full 3D scene
  • View synthesis with explicit representations
  • Layered Depth Image (LDI) uses several layers of depth maps and associated color values
  • Multiplane Image (MPI) is a popular variant of LDI
  • Structural MPI overcomes the MPI’s drawbacks
  • View synthesis with implicit representations
  • NeRF encodes 3D objects and scenes in the weights of an MLP
  • Planar 3D Approximation
  • Piece-wise planar depth map reconstruction is a traditional research topic
  • Detection-based framework generates plane segments and 3D plane parameters

Structural multiplane image

  • Introduces structural multiplane image representation
  • Geometry formulation based on standard MPI formulation

Geometry formulation

  • MPI consists of a collection of planes parallel to the image plane
  • Non-planar regions are represented as Non-planar regions
  • Fronto-parallel MPI is a special case of S-MPI where all planes have the same normal as the image plane
  • Non-planar regions are distributed into nearby fronto-parallel planes
  • S-MPI contains hybrid structures of geometrically faithful planes and depth-adaptive fronto-parallel planes
  • Final S-MPI for the complete scene is extended with N n elements with a fixed normal n z

Rendering formulation

  • Standard MPI has a global back-to-front rendering order of planes for each pixel
  • S-MPI has different rendering orders for pixels due to planes intersecting with each other
  • Calculate depth values for each pixel on each plane
  • Rearrange RGBα images with the depth order for each pixel
  • Render novel views by transforming plane parameters from the source view

Structural multiplane image transformer

  • Goal is to construct S-MPI representation for novel view synthesis and planar reconstruction
  • Model predicts S-MPI with appropriate number of posed planes and RGBα layers
  • Model inspired by neural planar reconstruction methods
  • Model differentiates non-planar instances according to their depth range
  • Model predicts proxies with structure class and visible projected mask of the plane

Single-view network

  • Model is built on a universal image segmentation network with two branches
  • One branch upsamples features and generates high-resolution per-pixel embeddings
  • The other branch generates N proxy embeddings at the instance level
  • Loss function jointly optimizes view synthesis and planar estimation
  • Loss function includes RGB L1 loss, SSIM loss, cross-entropy loss, focal loss, dice loss, and L1 loss

Multi-view network

  • Multi-view input can enlarge the range of view synthesis
  • Single-view methods generating planar reconstructions are not aligned in 3D space
  • Goal is to deliver multi-view consistent planar reconstruction
  • Network design uses global proxy embeddings across views
  • Multi-view alignment uses global plane poses to align instances in each view
  • RGBα embeddings and segmentation embeddings are shared globally
  • Image merging fuses rendered images with alpha weights as confidence maps

Experiments

  • S-MPI is a computer science tool that is effective for view synthesis and reconstruction from single-view and multi-view settings.
  • S-MPI is tested on two datasets, NYUv2 and ScanNet, and outperforms other methods.

Implementation

  • Ground truth labels of plane instance masks and plane parameters used from [25]
  • Non-planar regions sampled S-MPIs according to depth range of scene
  • Images resized to 256x384 for training and evaluation
  • ResNet50 used as backbone
  • Adam Optimizer used with initial learning rate of 0.0001 and weight decay of 0.05
  • Bootstrap training phase of 50k steps
  • Model trained on 4 NVIDIA-V100 GPUs for 100k steps
  • Multi-view input with T=2

Single-view evaluation

  • Planar reconstruction methods are evaluated using planar estimation metrics and segmentation metrics
  • Depth estimation methods are compared with planar reconstruction methods and MPI-based view synthesis methods
  • View synthesis methods are compared with standard MPI-based methods
  • Results show that the proposed method achieves better reconstruction and view synthesis performance

Multi-view evaluation

  • Compared PlaneMVS on plane detection and depth reconstruction accuracy
  • Measured plane detection by Average Precision with IoU 0.5 and varying depth error
  • Compared single-view planar segmentation results on ScanNet
  • Followed data settings from DP-NeRF to sample images sparsely and generate dense depth maps
  • Trained on two nearest images and compared results to MPI-based and NeRF-based methods
  • Outperformed MPI-based methods and achieved comparable results to NeRF-based methods
  • Compared rendering speed with NeRF-based method

Ablation study

  • Multi-view model trained with input image pairs with gap of < 20 frames and view synthesis target image < 30 frames away from middle of two input images
  • Global proxy embedding strategy produces better aligned images than processing multi-view images one by one
  • Method benefits from consistent multi-view geometry supervision and produces more accurate planar reconstruction than given two identical images
  • Performance drops if view gap is too large
  • Method performs better in scenes with more planar coverage

Conclusion

  • Introduction of Structural MPI (S-MPI) representation for neural view synthesis and 3D reconstruction
  • End-to-end model proposed to construct S-MPI
  • Global alignment scheme to generate aligned images for view synthesis
  • Limitations and future work
  • Need ground-truth plane segmentation and poses
  • Capability to use multiple parallel layers to simulate non-Lambertian effects
  • Figures 1-8 to illustrate S-MPI, formulation, rendering, transformer, embedding, reconstruction, synthesis, and ablation study
  • Quantitative view synthesis results on ScanNet