Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Humans use natural language to refer to 3D locations
Language Embedded Radiance Fields (LERFs) is a method for grounding language embeddings into NeRF
LERF learns a dense, multi-scale language field inside NeRF
LERF can extract 3D relevancy maps for language prompts in real-time
LERF enables zero-shot queries on 3D CLIP embeddings without relying on region proposals or masks

Paper Content

Introduction

Neural Radiance Fields (NeRFs) can capture photorealistic digital representations of 3D scenes
Natural language is an intuitive interface for interacting with a 3D scene
Language Embedded Radiance Fields (LERF) grounds language within NeRF by optimizing embeddings from a vision-language model
LERF preserves the integrity of CLIP embeddings at multiple scales, allowing it to handle a broad range of language queries
LERF utilizes self-supervised DINO features to regularize the optimized language field
LERF can localize both fine-grained and abstract queries across in-the-wild scenes
LERF has potential use cases in robotics, analyzing vision-language models, and interacting with 3D scenes

Open-Vocabulary Object Detection approaches lie on a spectrum from zero-shot to fully trained on segmentation datasets
LSeg trains a 2D image encoder on labeled segmentation datasets
CRIS and CLIPSeg train a 2D image decoder to output a relevancy map
Common approach for 2D images is a two-stage framework with class-agnostic region or mask proposals
OpenSeg and ViLD use CLIP to classify 2D regions from class-agnostic mask proposal networks
Detic builds on existing two-stage object detector approaches
OWL-ViT attaches lightweight object classification and localization heads after a pre-trained 2D image encoder
LERF avoids region proposals by incorporating language embeddings in a dense, 3D, multiscale field
Grad-CAM and attention-based methods provide a relevancy mapping between 2D images and text
NeRF has an attractive property of averaging information across multiple views
Semantic NeRF and Panoptic Lifting embed semantic information from semantic segmentation networks into 3D
Distilled Feature Fields and Neural Feature Fusion Fields explore embedding pixel-aligned feature vectors into NeRF
LERF embeds feature vectors into NeRF without fine-tuning
3D Language Grounding has been explored in a wide range of contexts
VL-Maps and Open-Scene build a 3D volume of language features which can be queried
CLIP-Fields and NLMaps-SayCan fuse CLIP embeddings of crops into pointclouds
ConceptFusion fuses CLIP features more densely in RGBD pointclouds
LERF provides a new dense, volumetric interface for 3D text queries

Multi-scale supervision

Supervising language field outputs requires querying language embeddings over image patches, not pixels.
Pre-computing an image pyramid with multiple image crop scales and storing the CLIP embeddings of each crop.
Randomly sampling ray origins uniformly throughout input views and randomly selecting a size for each.
Performing trilinear interpolation between the embeddings from the 4 nearest crops for the scale above and below.
Minimizing a loss between rendered and ground truth embeddings to maximize cosine similarity between the two.

Dino regularization

Naïvely implementing LERF produces cohesive results, but can be patchy and contain outliers.
To mitigate this, a field F dino is trained which outputs a DINO feature at each point.
DINO has been shown to exhibit emergent object decomposition properties and distills well into 3D fields.
F dino is supervised for each ray with the DINO feature it corresponds to.
DINO is used explicitly during inference and serves as an extra regularizer during training.

Field architecture

Intuitively, optimizing a language embedding in 3D should not influence the distribution of density in the underlying scene representation.
Two separate networks are trained: one for feature vectors and the other for standard NeRF outputs.
Gradients from language and feature vector networks do not affect NeRF outputs.
Language and radiance fields are represented with a multi-resolution hashgrid.
Querying LERF involves obtaining a relevancy score for a rendered embedding and automatically selecting a scale.

Implementation details

Implemented LERF in Nerfstudio
Reduced number of LERF samples from 48 to 24
Used OpenClip ViT-B/16 model trained on LAION-2B dataset
Hashgrid used for language features has 32 layers from 16 to 512
CLIP MLP used for F lang has 3 hidden layers with width 256
DINO MLP for F DINO has 1 hidden layer of dimension 256
Adam optimizer used for proposal networks and fields with weight decay 10-9
Exponential learning rate scheduler from 10-2 to 10-3 over first 5000 training steps
Trained on NVIDIA A100, takes roughly 20GB of memory
λ used in weighting CLIP loss is 0.01

Experiments

LERF can process a variety of natural language queries.
Existing 3D scan datasets are limited in scope.
13 scenes were collected using the iPhone app Polycam.

Qualitative results

Relevancy score is visualized by normalizing the colormap from 50% to the maximum relevancy.
Visualizations of all scenes can be found in the Appendix and Fig. 3.
LERF captures language features of a scene at different levels of detail.
Objects can be relevant to multiple queries.

Existence determination

Evaluated if LERF can detect objects in a scene
Labeled ground truth existence for 5 scenes
Collected two sets of labels: COCO and own long-tail labels
LERF determines if object exists by rendering pointcloud and returning “True” if any point has relevancy score over threshold
Compared against distilling LSeg features into 3D
Removed scale as parameter to F lang for LSeg
Reported precision-recall curves over relevancy score thresholds
LSeg only performs well on common objects in its training set

Localization

Evaluated LERF to localize text prompts in a scene
Rendered novel views and labeled bounding boxes for 72 objects across 5 scenes
Compared against LSeg and OwL-ViT
Results suggest LERF outperforms LSeg in 3D for localizing relevant parts of a scene
OwL-ViT outperforms LSeg in 3D, but suffers compared to LERF on long-tail queries
LERF struggles with visually similar objects and global/spatial reasoning
Ablated multi-scale CLIP supervision and found it significantly impairs LERF’s ability to handle queries of all scales
Language queries from LERF often exhibit “bag-of-words” behavior
LERF requires known calibrated camera matrices and NeRF-quality multi-view captures
Relevancy maps across the scene can group similar regions to a query together, or provide too many relevant regions
Visualized relevancy maps and RGB renders of the kitchen scene and figurines scene after 1k, 2k, 6k, and 30k steps
Provided raw relevancy scores for the queries in Fig. 1 of the main text
Language and visual ambiguities from CLIP can cause incorrect relevancy renders
LERF can improve relevancy maps with more specific queries
Lack of geometric separation can cause relevancy maps to blur into other surrounding objects

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Multi-scale supervision#

Dino regularization#

Field architecture#

Implementation details#

Experiments#

Qualitative results#

Existence determination#

Localization#

Link to paper

Abstract

Paper Content

Introduction

Related work

Multi-scale supervision

Dino regularization

Field architecture

Implementation details

Experiments

Qualitative results

Existence determination

Localization