Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Compositional representations of the world are a promising step towards enabling high-level scene understanding.
  • Learning such representations for complex scenes and tasks is an open challenge.
  • NRC is a method for learning object-centric representations through novel view reconstruction.
  • NRC representations transfer well to object navigation in THOR.
  • NRC performs unsupervised segmentation better than prior methods.
  • NRC improves on the task of depth ordering by 5.5% accuracy in THOR.

Paper Content

Introduction

  • Human perception and reasoning involves parsing the world into objects
  • Object-centric representations enable us to infer attributes such as geometry, affordances, and physical properties of objects
  • Learning models of the world without explicit supervision is an open challenge
  • Recent work focuses on reconstructing images from sparse encodings as an objective for learning object-centric representations
  • Neural rendering has enabled learning geometric representations of objects from 2D images
  • Recent work has leveraged scene reconstruction from different views as a source of supervision for learning object-centric representations
  • Neural Radiance Codebooks (NRC) learns a codebook of object categories which are composed to explain the appearance of 3D scenes from multiple views
  • NRC resolves the limitations of current 3D object-centric methods
  • Object-centric learning aims to build models of the world from building blocks
  • Prior works have demonstrated the potential to disentangle objects from images
  • Recent work has shown novel view reconstruction to be a promising approach for disentangling object representations
  • Neural rendering advances have enabled a host of new applications
  • Dictionary/codebook learning involves learning a specific set of atoms or codes
  • Object-centric codebooks help in semantic grounding for transfer between category instances

Method

  • Goal is to discover object categories without supervision
  • Learn priors over geometry and visual appearance
  • Model variation between instances belonging to each group
  • Explain all views of scene given set of object-codes
  • Use convolutional network to obtain spatial feature map
  • Project feature map to novel view using relative camera matrix
  • Assign feature vectors to categorical latent codes from codebook
  • Transform categorical codes to fine-grained instance codes
  • Render scene from novel view using volumetric renderer
  • Compare rendered image to ground truth using L2 pixel loss
  • Use nearest-neighbor assignment during inference
  • Use softmax relaxation of nearest-neighbors during backward pass
  • Use step functions with straight-through-estimator to add elements to codebook
  • Model intra-code variation in latent space using encoder
  • Render scene in novel view and compare to ground truth
  • Use MLP conditioned on instance codes to render scene
  • Evaluate learned representation on downstream tasks

Input image

Nrc ground truth

  • Unsupervised segmentation of real-world images
  • Vector assigned to nearest categorical code to obtain segmentation mask
  • NRC representation replaces pretrained network
  • Depth ordering task predicts which of two objects is closer to camera
  • MLP conditioned on instance codes to predict depth map

Experiments

  • NRC improves performance on unsupervised segmentation, object navigation, and depth ordering.
  • Prior works focused on unsupervised segmentation to measure decomposition quality.

Datasets

  • THOR consists of interactive home environments built in the Unity game engine
  • RoboTHOR is a variant of THOR used for sim2real benchmarking
  • Object navigation task consists of agent locating specified objects
  • RoboTHOR consists of 89 indoor scenes
  • ProcTHOR consists of procedurally generated indoor scenes
  • CLEVR-3D is a synthetic dataset used for unsupervised segmentation
  • ARI measures agreement between predicted and ground truth segmentations

Nyu depth

  • NYU Depth Dataset consists of images from real-world indoor scenes with depth and segmentation maps
  • Methods are trained on ProcThor dataset and evaluated on NYU Depth
  • NYU Depth chosen because it has similar object categories and scene layouts to THOR
  • Results reported using adjusted random index (ARI)

Results

  • NRC outperforms the best baseline by 3% in success rate and 20% in SPL
  • NRC has 18.8% improvement in SR and relative SPL over uORF and ObSuRF
  • NRC performs better than uORF and ObSuRF when navigating near furniture and other immovable objects

Object navigation

  • Experiments conducted in THOR to understand how well learned representations transfer from observational data to embodied navigation
  • Object navigation consists of an agent with the goal of moving through indoor scenes to specified objects
  • Dataset consists of 1.5 million video frames from 500 indoor scenes
  • Representations frozen following standard practice
  • Policy trained using DD-PPO for 200M steps
  • Success rate and success weighted by path length reported
  • Baselines include ObSuRF, uORF, Video MoCo, and EmbedCLIP

Depth ordering

  • Evaluated NRC on task of ordering objects based on depth from camera
  • Used ProcTHOR test dataset which provides dense depth and segmentation maps
  • Determined ground truth depth by computing mean depth of pixels associated with ground truth segmentation mask
  • Goal is to predict which object is closer
  • Evaluated 2,000 object pairs
  • Reported accuracy as number of correct orderings over total number of object pairs

Method

  • Depth ordering requires accurate segmentation and depth estimation
  • NRC has 5.5% and 10.3% higher depth ordering accuracy than ObSuRF and uORF respectively
  • NRC has better fine-grained localization of categorical latent codes which leads to better depth ordering
  • Ablation study shows that performance improves by โˆผ 9% when intra-class variation is modeled

Ablation study

  • Learning the number of codes improves performance
  • Performance is matched if number of codes is found via hyper-parameter tuning
  • Differentiably learning the codebook size avoids computationally expensive hyper-parameter tuning

Limitations

  • Novel view reconstruction requires camera pose, which is not always available.
  • Some datasets provide data from inertial measurement units, but this is prone to drift.
  • Most object-centric prior works assume scenes are static, but this is rarely the case.
  • NRC is relatively efficient compared to other NeRF based methods, but still compute and memory intensive.
  • Object-centric light fields can reduce memory and compute costs.
  • Finding appropriate corresponding frames of videos is a challenge.

Conclusion

  • Compositional, object-centric understanding of the world is a fundamental characteristic of human vision.
  • Neural Radiance Field Codebooks (NRC) is a new approach for learning geometry-aware, object-centric representations.
  • NRC finds reoccurring geometric and visual similarities to form objects categories.
  • NRC improves performance on object navigation and depth ordering compared to strong baselines.
  • NRC is capable of scaling to complex scenes with more objects and greater diversity.
  • NRC shows relative ARI improvement over baselines for unsupervised segmentation.
  • NRC performs slightly worse on MatterPort3D compared to on THOR.
  • NRC outperforms ObSuRF and ImageNet pretraining on the Habitat Point Navigation Challenge.
  • NRC is trained with a sliding window of 5 frames.
  • NRC is trained with a ResNet-50 initialized with a MoCo backbone pretrained on ImageNet.
  • NRC is evaluated on the first 500 scenes of the validation set of CLEVR-3D.
  • NRC is evaluated on scenes with 5 or more objects from NYU Depth.
  • NRC is trained with a 1-layer GRU with 512 hidden units.
  • NRC is trained with a ResNet-50 architecture.
  • NRC is trained from scratch and from the video MoCo model trained on ProcTHOR.
  • NRC is trained with 10 slots.