Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Humans use natural language to refer to 3D locations
- Language Embedded Radiance Fields (LERFs) is a method for grounding language embeddings into NeRF
- LERF learns a dense, multi-scale language field inside NeRF
- LERF can extract 3D relevancy maps for language prompts in real-time
- LERF enables zero-shot queries on 3D CLIP embeddings without relying on region proposals or masks
Paper Content
Introduction
- Neural Radiance Fields (NeRFs) can capture photorealistic digital representations of 3D scenes
- Natural language is an intuitive interface for interacting with a 3D scene
- Language Embedded Radiance Fields (LERF) grounds language within NeRF by optimizing embeddings from a vision-language model
- LERF preserves the integrity of CLIP embeddings at multiple scales, allowing it to handle a broad range of language queries
- LERF utilizes self-supervised DINO features to regularize the optimized language field
- LERF can localize both fine-grained and abstract queries across in-the-wild scenes
- LERF has potential use cases in robotics, analyzing vision-language models, and interacting with 3D scenes
Related work
- Open-Vocabulary Object Detection approaches lie on a spectrum from zero-shot to fully trained on segmentation datasets
- LSeg trains a 2D image encoder on labeled segmentation datasets
- CRIS and CLIPSeg train a 2D image decoder to output a relevancy map
- Common approach for 2D images is a two-stage framework with class-agnostic region or mask proposals
- OpenSeg and ViLD use CLIP to classify 2D regions from class-agnostic mask proposal networks
- Detic builds on existing two-stage object detector approaches
- OWL-ViT attaches lightweight object classification and localization heads after a pre-trained 2D image encoder
- LERF avoids region proposals by incorporating language embeddings in a dense, 3D, multiscale field
- Grad-CAM and attention-based methods provide a relevancy mapping between 2D images and text
- NeRF has an attractive property of averaging information across multiple views
- Semantic NeRF and Panoptic Lifting embed semantic information from semantic segmentation networks into 3D
- Distilled Feature Fields and Neural Feature Fusion Fields explore embedding pixel-aligned feature vectors into NeRF
- LERF embeds feature vectors into NeRF without fine-tuning
- 3D Language Grounding has been explored in a wide range of contexts
- VL-Maps and Open-Scene build a 3D volume of language features which can be queried
- CLIP-Fields and NLMaps-SayCan fuse CLIP embeddings of crops into pointclouds
- ConceptFusion fuses CLIP features more densely in RGBD pointclouds
- LERF provides a new dense, volumetric interface for 3D text queries
Multi-scale supervision
- Supervising language field outputs requires querying language embeddings over image patches, not pixels.
- Pre-computing an image pyramid with multiple image crop scales and storing the CLIP embeddings of each crop.
- Randomly sampling ray origins uniformly throughout input views and randomly selecting a size for each.
- Performing trilinear interpolation between the embeddings from the 4 nearest crops for the scale above and below.
- Minimizing a loss between rendered and ground truth embeddings to maximize cosine similarity between the two.
Dino regularization
- Naïvely implementing LERF produces cohesive results, but can be patchy and contain outliers.
- To mitigate this, a field F dino is trained which outputs a DINO feature at each point.
- DINO has been shown to exhibit emergent object decomposition properties and distills well into 3D fields.
- F dino is supervised for each ray with the DINO feature it corresponds to.
- DINO is used explicitly during inference and serves as an extra regularizer during training.
Field architecture
- Intuitively, optimizing a language embedding in 3D should not influence the distribution of density in the underlying scene representation.
- Two separate networks are trained: one for feature vectors and the other for standard NeRF outputs.
- Gradients from language and feature vector networks do not affect NeRF outputs.
- Language and radiance fields are represented with a multi-resolution hashgrid.
- Querying LERF involves obtaining a relevancy score for a rendered embedding and automatically selecting a scale.
Implementation details
- Implemented LERF in Nerfstudio
- Reduced number of LERF samples from 48 to 24
- Used OpenClip ViT-B/16 model trained on LAION-2B dataset
- Hashgrid used for language features has 32 layers from 16 to 512
- CLIP MLP used for F lang has 3 hidden layers with width 256
- DINO MLP for F DINO has 1 hidden layer of dimension 256
- Adam optimizer used for proposal networks and fields with weight decay 10-9
- Exponential learning rate scheduler from 10-2 to 10-3 over first 5000 training steps
- Trained on NVIDIA A100, takes roughly 20GB of memory
- λ used in weighting CLIP loss is 0.01
Experiments
- LERF can process a variety of natural language queries.
- Existing 3D scan datasets are limited in scope.
- 13 scenes were collected using the iPhone app Polycam.
Qualitative results
- Relevancy score is visualized by normalizing the colormap from 50% to the maximum relevancy.
- Visualizations of all scenes can be found in the Appendix and Fig. 3.
- LERF captures language features of a scene at different levels of detail.
- Objects can be relevant to multiple queries.
Existence determination
- Evaluated if LERF can detect objects in a scene
- Labeled ground truth existence for 5 scenes
- Collected two sets of labels: COCO and own long-tail labels
- LERF determines if object exists by rendering pointcloud and returning “True” if any point has relevancy score over threshold
- Compared against distilling LSeg features into 3D
- Removed scale as parameter to F lang for LSeg
- Reported precision-recall curves over relevancy score thresholds
- LSeg only performs well on common objects in its training set
Localization
- Evaluated LERF to localize text prompts in a scene
- Rendered novel views and labeled bounding boxes for 72 objects across 5 scenes
- Compared against LSeg and OwL-ViT
- Results suggest LERF outperforms LSeg in 3D for localizing relevant parts of a scene
- OwL-ViT outperforms LSeg in 3D, but suffers compared to LERF on long-tail queries
- LERF struggles with visually similar objects and global/spatial reasoning
- Ablated multi-scale CLIP supervision and found it significantly impairs LERF’s ability to handle queries of all scales
- Language queries from LERF often exhibit “bag-of-words” behavior
- LERF requires known calibrated camera matrices and NeRF-quality multi-view captures
- Relevancy maps across the scene can group similar regions to a query together, or provide too many relevant regions
- Visualized relevancy maps and RGB renders of the kitchen scene and figurines scene after 1k, 2k, 6k, and 30k steps
- Provided raw relevancy scores for the queries in Fig. 1 of the main text
- Language and visual ambiguities from CLIP can cause incorrect relevancy renders
- LERF can improve relevancy maps with more specific queries
- Lack of geometric separation can cause relevancy maps to blur into other surrounding objects