Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Humans use natural language to refer to 3D locations
  • Language Embedded Radiance Fields (LERFs) is a method for grounding language embeddings into NeRF
  • LERF learns a dense, multi-scale language field inside NeRF
  • LERF can extract 3D relevancy maps for language prompts in real-time
  • LERF enables zero-shot queries on 3D CLIP embeddings without relying on region proposals or masks

Paper Content

Introduction

  • Neural Radiance Fields (NeRFs) can capture photorealistic digital representations of 3D scenes
  • Natural language is an intuitive interface for interacting with a 3D scene
  • Language Embedded Radiance Fields (LERF) grounds language within NeRF by optimizing embeddings from a vision-language model
  • LERF preserves the integrity of CLIP embeddings at multiple scales, allowing it to handle a broad range of language queries
  • LERF utilizes self-supervised DINO features to regularize the optimized language field
  • LERF can localize both fine-grained and abstract queries across in-the-wild scenes
  • LERF has potential use cases in robotics, analyzing vision-language models, and interacting with 3D scenes
  • Open-Vocabulary Object Detection approaches lie on a spectrum from zero-shot to fully trained on segmentation datasets
  • LSeg trains a 2D image encoder on labeled segmentation datasets
  • CRIS and CLIPSeg train a 2D image decoder to output a relevancy map
  • Common approach for 2D images is a two-stage framework with class-agnostic region or mask proposals
  • OpenSeg and ViLD use CLIP to classify 2D regions from class-agnostic mask proposal networks
  • Detic builds on existing two-stage object detector approaches
  • OWL-ViT attaches lightweight object classification and localization heads after a pre-trained 2D image encoder
  • LERF avoids region proposals by incorporating language embeddings in a dense, 3D, multiscale field
  • Grad-CAM and attention-based methods provide a relevancy mapping between 2D images and text
  • NeRF has an attractive property of averaging information across multiple views
  • Semantic NeRF and Panoptic Lifting embed semantic information from semantic segmentation networks into 3D
  • Distilled Feature Fields and Neural Feature Fusion Fields explore embedding pixel-aligned feature vectors into NeRF
  • LERF embeds feature vectors into NeRF without fine-tuning
  • 3D Language Grounding has been explored in a wide range of contexts
  • VL-Maps and Open-Scene build a 3D volume of language features which can be queried
  • CLIP-Fields and NLMaps-SayCan fuse CLIP embeddings of crops into pointclouds
  • ConceptFusion fuses CLIP features more densely in RGBD pointclouds
  • LERF provides a new dense, volumetric interface for 3D text queries

Multi-scale supervision

  • Supervising language field outputs requires querying language embeddings over image patches, not pixels.
  • Pre-computing an image pyramid with multiple image crop scales and storing the CLIP embeddings of each crop.
  • Randomly sampling ray origins uniformly throughout input views and randomly selecting a size for each.
  • Performing trilinear interpolation between the embeddings from the 4 nearest crops for the scale above and below.
  • Minimizing a loss between rendered and ground truth embeddings to maximize cosine similarity between the two.

Dino regularization

  • Naïvely implementing LERF produces cohesive results, but can be patchy and contain outliers.
  • To mitigate this, a field F dino is trained which outputs a DINO feature at each point.
  • DINO has been shown to exhibit emergent object decomposition properties and distills well into 3D fields.
  • F dino is supervised for each ray with the DINO feature it corresponds to.
  • DINO is used explicitly during inference and serves as an extra regularizer during training.

Field architecture

  • Intuitively, optimizing a language embedding in 3D should not influence the distribution of density in the underlying scene representation.
  • Two separate networks are trained: one for feature vectors and the other for standard NeRF outputs.
  • Gradients from language and feature vector networks do not affect NeRF outputs.
  • Language and radiance fields are represented with a multi-resolution hashgrid.
  • Querying LERF involves obtaining a relevancy score for a rendered embedding and automatically selecting a scale.

Implementation details

  • Implemented LERF in Nerfstudio
  • Reduced number of LERF samples from 48 to 24
  • Used OpenClip ViT-B/16 model trained on LAION-2B dataset
  • Hashgrid used for language features has 32 layers from 16 to 512
  • CLIP MLP used for F lang has 3 hidden layers with width 256
  • DINO MLP for F DINO has 1 hidden layer of dimension 256
  • Adam optimizer used for proposal networks and fields with weight decay 10-9
  • Exponential learning rate scheduler from 10-2 to 10-3 over first 5000 training steps
  • Trained on NVIDIA A100, takes roughly 20GB of memory
  • λ used in weighting CLIP loss is 0.01

Experiments

  • LERF can process a variety of natural language queries.
  • Existing 3D scan datasets are limited in scope.
  • 13 scenes were collected using the iPhone app Polycam.

Qualitative results

  • Relevancy score is visualized by normalizing the colormap from 50% to the maximum relevancy.
  • Visualizations of all scenes can be found in the Appendix and Fig. 3.
  • LERF captures language features of a scene at different levels of detail.
  • Objects can be relevant to multiple queries.

Existence determination

  • Evaluated if LERF can detect objects in a scene
  • Labeled ground truth existence for 5 scenes
  • Collected two sets of labels: COCO and own long-tail labels
  • LERF determines if object exists by rendering pointcloud and returning “True” if any point has relevancy score over threshold
  • Compared against distilling LSeg features into 3D
  • Removed scale as parameter to F lang for LSeg
  • Reported precision-recall curves over relevancy score thresholds
  • LSeg only performs well on common objects in its training set

Localization

  • Evaluated LERF to localize text prompts in a scene
  • Rendered novel views and labeled bounding boxes for 72 objects across 5 scenes
  • Compared against LSeg and OwL-ViT
  • Results suggest LERF outperforms LSeg in 3D for localizing relevant parts of a scene
  • OwL-ViT outperforms LSeg in 3D, but suffers compared to LERF on long-tail queries
  • LERF struggles with visually similar objects and global/spatial reasoning
  • Ablated multi-scale CLIP supervision and found it significantly impairs LERF’s ability to handle queries of all scales
  • Language queries from LERF often exhibit “bag-of-words” behavior
  • LERF requires known calibrated camera matrices and NeRF-quality multi-view captures
  • Relevancy maps across the scene can group similar regions to a query together, or provide too many relevant regions
  • Visualized relevancy maps and RGB renders of the kitchen scene and figurines scene after 1k, 2k, 6k, and 30k steps
  • Provided raw relevancy scores for the queries in Fig. 1 of the main text
  • Language and visual ambiguities from CLIP can cause incorrect relevancy renders
  • LERF can improve relevancy maps with more specific queries
  • Lack of geometric separation can cause relevancy maps to blur into other surrounding objects