Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Recent works have shown promising results on novel view synthesis from single or few images.
Models have rarely been applied on other downstream tasks beyond synthesis such as semantic understanding and parsing.
Proposed framework named FeatureNeRF to learn generalizable NeRFs by distilling pre-trained vision foundation models.
FeatureNeRF maps 2D images to continuous 3D semantic feature volumes, which can be used for various downstream tasks.
Evaluated on tasks of 2D/3D semantic keypoint transfer and 2D/3D object part segmentation.
Extensive experiments demonstrate effectiveness of FeatureNeRF as a generalizable 3D semantic feature extractor.

Paper Content

Introduction

Neural fields are used to represent visual signals
Neural Radiance Fields (NeRF) can synthesize views from dense input images
Followup works reduce dependency on dense inputs and generalize NeRF to unseen objects
Question: Can NeRF be adapted to learn 3D representations for general 3D applications?
Recent years have seen rise of vision foundation models that are pre-trained on web-scale image datasets
Goal: Leverage 2D foundation models to obtain generalizable 3D features
FeatureNeRF: Unified framework for learning generalizable NeRFs from distilling pretrained 2D vision foundation models
FeatureNeRF extracts deep features from NeRFs as generalizable 3D visual descriptors
FeatureNeRF can predict 3D semantic feature volume from single or few images
Used for tasks such as semantic keypoint transfer and object part co-segmentation
Evaluated with two foundation models: DINO and Latent Diffusion

Neural fields have been used to represent various visual signals
NeRF is used for photo-realistic novel view synthesis
Follow-up methods have been proposed to learn generalizable NeRFs from large-scale multi-view image datasets
Foundation models have been used to transfer to various vision tasks
Distillation has been used for model compression and knowledge transfer
Semantic correspondences learning aims to find corresponding points between visual observations
Our work is the first to address semantic correspondences in both 2D and 3D space with only 2D observations

Method

Presents FeatureNeRF, a framework for learning generalizable NeRF from vision foundation models
Explains feature distillation process
Introduces how to learn internal NeRF features for 3D semantic understanding and downstream applications

Preliminary: generalizable nerf

Neural Radiance Fields (NeRF) consists of two functions that map 3D points to density and color.
NeRF is optimized to a single scene with multi-view posed images by minimizing a reconstruction loss.
NeRF can be conditioned on an input image by passing it through an encoder and MLPs to predict appearance and geometry.

Feature distillation from foundation models

NeRFs can be extended to predict other quantities of interest
SemanticNeRF and PanopticNeRF predict segmentation labels
This paper aims to transfer knowledge from a pre-trained foundation model to NeRFs
A branch is added to output a high-dimensional feature vector
FeatureNeRF is trained jointly for image reconstruction and feature distillation

Learning internal nerf features for 3d semantic understanding

Features are distilled from foundation models for 3D semantic understanding.
A new coordinate loss is introduced to learn spatial-aware NeRF features.
Features from intermediate layers are used as continuous 3D visual descriptors.

Applications of featurenerf

Learned 3D semantic NeRF feature function is effective for various downstream applications
Simple, zero-shot methodologies are used to validate the proposed representations
Cosine similarity is used to measure distance in the feature space
FeatureNeRF model can be leveraged to 3D tasks

Experiments

Experimental setting

Experiments conducted on 6 categories from ShapeNet dataset
Evaluated using annotations from KeypointNet, ShapeNet part dataset and PartNet
Trained on CO3D dataset and evaluated keypoint transfer performance on Spair dataset
Shapes normalized so longest edges of bounding box are equal
50 random camera poses from upper hemisphere sampled for training set
50 fixed camera poses used for validation and testing sets
RGB images rendered with resolution of 128x128
Level-1 annotations used for PartNet
Compared to DINO and Latent Diffusion models
Patch size of DINO is 8, final feature map has dimension 32x32x384
Language prompts used to condition denoising modules are class names from ShapeNet

2d semantic understanding tasks

Evaluate FeatureNeRF on tasks of 2D keypoints transfer and part segmentation labels transfer
Report percentage of predicted keypoints whose distances from their corresponding ground truths are below thresholds of (2.5, 5.0, 7.5, 10.0) pixels in the target image
Calculate mean intersection over union (mIoU) over every part category for each object class
FeatureNeRF distilled with DINO features significantly outperforms other approaches for all 6 categories
FeatureNeRF learns a 3D representation from 2D observations
FeatureNeRF produces better boundaries and preserves details like small parts
Fine-tune model on CO3D datasets and evaluate keypoints transfer performance on two categories from the SPair dataset

3d semantic understanding tasks

Proposed method is validated on 3D semantic understanding tasks
Metrics for 3D tasks are 3D versions of their counterparts in 2D tasks
Method outperforms all baselines in 3D tasks
Method is able to transfer 3D keypoints and segmentation labels accurately

Ablation study

Ablated coordinate loss and internal NeRF features for semantic understanding
Conducted experiments of 2D parts co-segmentation on Chair and Plane classes
Results show design choices boost performance
Compared performance of novel-view synthesis with PixelNeRF, found comparable performance

Conclusion

Presents FeatureNeRF, a unified framework for learning generalizable NeRFs from distilling 2D vision foundation models
Explores use of internal NeRF features as 3D visual descriptors
Predicts 3D semantic feature volume from single image
Demonstrates effectiveness on 2D/3D keypoints transfer and part co-segmentation
Compares FeatureNeRF to PWarpC, outperforms under smaller pixel distances
Leverages 3D feature volume for editing applications, example of 3D part texture swapping
Ablation study shows coordinate loss and use of internal NeRF features boost performance
Comparable performance with PixelNeRF on novel-view synthesis

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

Preliminary: generalizable nerf#

Feature distillation from foundation models#

Learning internal nerf features for 3d semantic understanding#

Applications of featurenerf#

Experiments#

Experimental setting#

2d semantic understanding tasks#

3d semantic understanding tasks#

Ablation study#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

Preliminary: generalizable nerf

Feature distillation from foundation models

Learning internal nerf features for 3d semantic understanding

Applications of featurenerf

Experiments

Experimental setting

2d semantic understanding tasks

3d semantic understanding tasks

Ablation study

Conclusion