Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Compositional representations of the world are a promising step towards enabling high-level scene understanding.
- Learning such representations for complex scenes and tasks is an open challenge.
- NRC is a method for learning object-centric representations through novel view reconstruction.
- NRC representations transfer well to object navigation in THOR.
- NRC performs unsupervised segmentation better than prior methods.
- NRC improves on the task of depth ordering by 5.5% accuracy in THOR.
Paper Content
Introduction
- Human perception and reasoning involves parsing the world into objects
- Object-centric representations enable us to infer attributes such as geometry, affordances, and physical properties of objects
- Learning models of the world without explicit supervision is an open challenge
- Recent work focuses on reconstructing images from sparse encodings as an objective for learning object-centric representations
- Neural rendering has enabled learning geometric representations of objects from 2D images
- Recent work has leveraged scene reconstruction from different views as a source of supervision for learning object-centric representations
- Neural Radiance Codebooks (NRC) learns a codebook of object categories which are composed to explain the appearance of 3D scenes from multiple views
- NRC resolves the limitations of current 3D object-centric methods
Related work
- Object-centric learning aims to build models of the world from building blocks
- Prior works have demonstrated the potential to disentangle objects from images
- Recent work has shown novel view reconstruction to be a promising approach for disentangling object representations
- Neural rendering advances have enabled a host of new applications
- Dictionary/codebook learning involves learning a specific set of atoms or codes
- Object-centric codebooks help in semantic grounding for transfer between category instances
Method
- Goal is to discover object categories without supervision
- Learn priors over geometry and visual appearance
- Model variation between instances belonging to each group
- Explain all views of scene given set of object-codes
- Use convolutional network to obtain spatial feature map
- Project feature map to novel view using relative camera matrix
- Assign feature vectors to categorical latent codes from codebook
- Transform categorical codes to fine-grained instance codes
- Render scene from novel view using volumetric renderer
- Compare rendered image to ground truth using L2 pixel loss
- Use nearest-neighbor assignment during inference
- Use softmax relaxation of nearest-neighbors during backward pass
- Use step functions with straight-through-estimator to add elements to codebook
- Model intra-code variation in latent space using encoder
- Render scene in novel view and compare to ground truth
- Use MLP conditioned on instance codes to render scene
- Evaluate learned representation on downstream tasks
Input image
Nrc ground truth
- Unsupervised segmentation of real-world images
- Vector assigned to nearest categorical code to obtain segmentation mask
- NRC representation replaces pretrained network
- Depth ordering task predicts which of two objects is closer to camera
- MLP conditioned on instance codes to predict depth map
Experiments
- NRC improves performance on unsupervised segmentation, object navigation, and depth ordering.
- Prior works focused on unsupervised segmentation to measure decomposition quality.
Datasets
- THOR consists of interactive home environments built in the Unity game engine
- RoboTHOR is a variant of THOR used for sim2real benchmarking
- Object navigation task consists of agent locating specified objects
- RoboTHOR consists of 89 indoor scenes
- ProcTHOR consists of procedurally generated indoor scenes
- CLEVR-3D is a synthetic dataset used for unsupervised segmentation
- ARI measures agreement between predicted and ground truth segmentations
Nyu depth
- NYU Depth Dataset consists of images from real-world indoor scenes with depth and segmentation maps
- Methods are trained on ProcThor dataset and evaluated on NYU Depth
- NYU Depth chosen because it has similar object categories and scene layouts to THOR
- Results reported using adjusted random index (ARI)
Results
- NRC outperforms the best baseline by 3% in success rate and 20% in SPL
- NRC has 18.8% improvement in SR and relative SPL over uORF and ObSuRF
- NRC performs better than uORF and ObSuRF when navigating near furniture and other immovable objects
Object navigation
- Experiments conducted in THOR to understand how well learned representations transfer from observational data to embodied navigation
- Object navigation consists of an agent with the goal of moving through indoor scenes to specified objects
- Dataset consists of 1.5 million video frames from 500 indoor scenes
- Representations frozen following standard practice
- Policy trained using DD-PPO for 200M steps
- Success rate and success weighted by path length reported
- Baselines include ObSuRF, uORF, Video MoCo, and EmbedCLIP
Depth ordering
- Evaluated NRC on task of ordering objects based on depth from camera
- Used ProcTHOR test dataset which provides dense depth and segmentation maps
- Determined ground truth depth by computing mean depth of pixels associated with ground truth segmentation mask
- Goal is to predict which object is closer
- Evaluated 2,000 object pairs
- Reported accuracy as number of correct orderings over total number of object pairs
Method
- Depth ordering requires accurate segmentation and depth estimation
- NRC has 5.5% and 10.3% higher depth ordering accuracy than ObSuRF and uORF respectively
- NRC has better fine-grained localization of categorical latent codes which leads to better depth ordering
- Ablation study shows that performance improves by โผ 9% when intra-class variation is modeled
Ablation study
- Learning the number of codes improves performance
- Performance is matched if number of codes is found via hyper-parameter tuning
- Differentiably learning the codebook size avoids computationally expensive hyper-parameter tuning
Limitations
- Novel view reconstruction requires camera pose, which is not always available.
- Some datasets provide data from inertial measurement units, but this is prone to drift.
- Most object-centric prior works assume scenes are static, but this is rarely the case.
- NRC is relatively efficient compared to other NeRF based methods, but still compute and memory intensive.
- Object-centric light fields can reduce memory and compute costs.
- Finding appropriate corresponding frames of videos is a challenge.
Conclusion
- Compositional, object-centric understanding of the world is a fundamental characteristic of human vision.
- Neural Radiance Field Codebooks (NRC) is a new approach for learning geometry-aware, object-centric representations.
- NRC finds reoccurring geometric and visual similarities to form objects categories.
- NRC improves performance on object navigation and depth ordering compared to strong baselines.
- NRC is capable of scaling to complex scenes with more objects and greater diversity.
- NRC shows relative ARI improvement over baselines for unsupervised segmentation.
- NRC performs slightly worse on MatterPort3D compared to on THOR.
- NRC outperforms ObSuRF and ImageNet pretraining on the Habitat Point Navigation Challenge.
- NRC is trained with a sliding window of 5 frames.
- NRC is trained with a ResNet-50 initialized with a MoCo backbone pretrained on ImageNet.
- NRC is evaluated on the first 500 scenes of the validation set of CLEVR-3D.
- NRC is evaluated on scenes with 5 or more objects from NYU Depth.
- NRC is trained with a 1-layer GRU with 512 hidden units.
- NRC is trained with a ResNet-50 architecture.
- NRC is trained from scratch and from the video MoCo model trained on ProcTHOR.
- NRC is trained with 10 slots.