Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Compositional representations of the world are a promising step towards enabling high-level scene understanding.
Learning such representations for complex scenes and tasks is an open challenge.
NRC is a method for learning object-centric representations through novel view reconstruction.
NRC representations transfer well to object navigation in THOR.
NRC performs unsupervised segmentation better than prior methods.
NRC improves on the task of depth ordering by 5.5% accuracy in THOR.

Paper Content

Introduction

Human perception and reasoning involves parsing the world into objects
Object-centric representations enable us to infer attributes such as geometry, affordances, and physical properties of objects
Learning models of the world without explicit supervision is an open challenge
Recent work focuses on reconstructing images from sparse encodings as an objective for learning object-centric representations
Neural rendering has enabled learning geometric representations of objects from 2D images
Recent work has leveraged scene reconstruction from different views as a source of supervision for learning object-centric representations
Neural Radiance Codebooks (NRC) learns a codebook of object categories which are composed to explain the appearance of 3D scenes from multiple views
NRC resolves the limitations of current 3D object-centric methods

Object-centric learning aims to build models of the world from building blocks
Prior works have demonstrated the potential to disentangle objects from images
Recent work has shown novel view reconstruction to be a promising approach for disentangling object representations
Neural rendering advances have enabled a host of new applications
Dictionary/codebook learning involves learning a specific set of atoms or codes
Object-centric codebooks help in semantic grounding for transfer between category instances

Method

Goal is to discover object categories without supervision
Learn priors over geometry and visual appearance
Model variation between instances belonging to each group
Explain all views of scene given set of object-codes
Use convolutional network to obtain spatial feature map
Project feature map to novel view using relative camera matrix
Assign feature vectors to categorical latent codes from codebook
Transform categorical codes to fine-grained instance codes
Render scene from novel view using volumetric renderer
Compare rendered image to ground truth using L2 pixel loss
Use nearest-neighbor assignment during inference
Use softmax relaxation of nearest-neighbors during backward pass
Use step functions with straight-through-estimator to add elements to codebook
Model intra-code variation in latent space using encoder
Render scene in novel view and compare to ground truth
Use MLP conditioned on instance codes to render scene
Evaluate learned representation on downstream tasks

Input image

Nrc ground truth

Unsupervised segmentation of real-world images
Vector assigned to nearest categorical code to obtain segmentation mask
NRC representation replaces pretrained network
Depth ordering task predicts which of two objects is closer to camera
MLP conditioned on instance codes to predict depth map

Experiments

NRC improves performance on unsupervised segmentation, object navigation, and depth ordering.
Prior works focused on unsupervised segmentation to measure decomposition quality.

Datasets

THOR consists of interactive home environments built in the Unity game engine
RoboTHOR is a variant of THOR used for sim2real benchmarking
Object navigation task consists of agent locating specified objects
RoboTHOR consists of 89 indoor scenes
ProcTHOR consists of procedurally generated indoor scenes
CLEVR-3D is a synthetic dataset used for unsupervised segmentation
ARI measures agreement between predicted and ground truth segmentations

Nyu depth

NYU Depth Dataset consists of images from real-world indoor scenes with depth and segmentation maps
Methods are trained on ProcThor dataset and evaluated on NYU Depth
NYU Depth chosen because it has similar object categories and scene layouts to THOR
Results reported using adjusted random index (ARI)

Results

NRC outperforms the best baseline by 3% in success rate and 20% in SPL
NRC has 18.8% improvement in SR and relative SPL over uORF and ObSuRF
NRC performs better than uORF and ObSuRF when navigating near furniture and other immovable objects

Experiments conducted in THOR to understand how well learned representations transfer from observational data to embodied navigation
Object navigation consists of an agent with the goal of moving through indoor scenes to specified objects
Dataset consists of 1.5 million video frames from 500 indoor scenes
Representations frozen following standard practice
Policy trained using DD-PPO for 200M steps
Success rate and success weighted by path length reported
Baselines include ObSuRF, uORF, Video MoCo, and EmbedCLIP

Depth ordering

Evaluated NRC on task of ordering objects based on depth from camera
Used ProcTHOR test dataset which provides dense depth and segmentation maps
Determined ground truth depth by computing mean depth of pixels associated with ground truth segmentation mask
Goal is to predict which object is closer
Evaluated 2,000 object pairs
Reported accuracy as number of correct orderings over total number of object pairs

Method

Depth ordering requires accurate segmentation and depth estimation
NRC has 5.5% and 10.3% higher depth ordering accuracy than ObSuRF and uORF respectively
NRC has better fine-grained localization of categorical latent codes which leads to better depth ordering
Ablation study shows that performance improves by ∼ 9% when intra-class variation is modeled

Ablation study

Learning the number of codes improves performance
Performance is matched if number of codes is found via hyper-parameter tuning
Differentiably learning the codebook size avoids computationally expensive hyper-parameter tuning

Limitations

Novel view reconstruction requires camera pose, which is not always available.
Some datasets provide data from inertial measurement units, but this is prone to drift.
Most object-centric prior works assume scenes are static, but this is rarely the case.
NRC is relatively efficient compared to other NeRF based methods, but still compute and memory intensive.
Object-centric light fields can reduce memory and compute costs.
Finding appropriate corresponding frames of videos is a challenge.

Conclusion

Compositional, object-centric understanding of the world is a fundamental characteristic of human vision.
Neural Radiance Field Codebooks (NRC) is a new approach for learning geometry-aware, object-centric representations.
NRC finds reoccurring geometric and visual similarities to form objects categories.
NRC improves performance on object navigation and depth ordering compared to strong baselines.
NRC is capable of scaling to complex scenes with more objects and greater diversity.
NRC shows relative ARI improvement over baselines for unsupervised segmentation.
NRC performs slightly worse on MatterPort3D compared to on THOR.
NRC outperforms ObSuRF and ImageNet pretraining on the Habitat Point Navigation Challenge.
NRC is trained with a sliding window of 5 frames.
NRC is trained with a ResNet-50 initialized with a MoCo backbone pretrained on ImageNet.
NRC is evaluated on the first 500 scenes of the validation set of CLEVR-3D.
NRC is evaluated on scenes with 5 or more objects from NYU Depth.
NRC is trained with a 1-layer GRU with 512 hidden units.
NRC is trained with a ResNet-50 architecture.
NRC is trained from scratch and from the video MoCo model trained on ProcTHOR.
NRC is trained with 10 slots.

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

Input image#

Nrc ground truth#

Experiments#

Datasets#

Nyu depth#

Results#

Object navigation#

Depth ordering#

Method#

Ablation study#

Limitations#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

Input image

Nrc ground truth

Experiments

Datasets

Nyu depth

Results

Object navigation

Depth ordering

Method

Ablation study

Limitations

Conclusion