Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Recent works have shown promising results on novel view synthesis from single or few images.
  • Models have rarely been applied on other downstream tasks beyond synthesis such as semantic understanding and parsing.
  • Proposed framework named FeatureNeRF to learn generalizable NeRFs by distilling pre-trained vision foundation models.
  • FeatureNeRF maps 2D images to continuous 3D semantic feature volumes, which can be used for various downstream tasks.
  • Evaluated on tasks of 2D/3D semantic keypoint transfer and 2D/3D object part segmentation.
  • Extensive experiments demonstrate effectiveness of FeatureNeRF as a generalizable 3D semantic feature extractor.

Paper Content

Introduction

  • Neural fields are used to represent visual signals
  • Neural Radiance Fields (NeRF) can synthesize views from dense input images
  • Followup works reduce dependency on dense inputs and generalize NeRF to unseen objects
  • Question: Can NeRF be adapted to learn 3D representations for general 3D applications?
  • Recent years have seen rise of vision foundation models that are pre-trained on web-scale image datasets
  • Goal: Leverage 2D foundation models to obtain generalizable 3D features
  • FeatureNeRF: Unified framework for learning generalizable NeRFs from distilling pretrained 2D vision foundation models
  • FeatureNeRF extracts deep features from NeRFs as generalizable 3D visual descriptors
  • FeatureNeRF can predict 3D semantic feature volume from single or few images
  • Used for tasks such as semantic keypoint transfer and object part co-segmentation
  • Evaluated with two foundation models: DINO and Latent Diffusion
  • Neural fields have been used to represent various visual signals
  • NeRF is used for photo-realistic novel view synthesis
  • Follow-up methods have been proposed to learn generalizable NeRFs from large-scale multi-view image datasets
  • Foundation models have been used to transfer to various vision tasks
  • Distillation has been used for model compression and knowledge transfer
  • Semantic correspondences learning aims to find corresponding points between visual observations
  • Our work is the first to address semantic correspondences in both 2D and 3D space with only 2D observations

Method

  • Presents FeatureNeRF, a framework for learning generalizable NeRF from vision foundation models
  • Explains feature distillation process
  • Introduces how to learn internal NeRF features for 3D semantic understanding and downstream applications

Preliminary: generalizable nerf

  • Neural Radiance Fields (NeRF) consists of two functions that map 3D points to density and color.
  • NeRF is optimized to a single scene with multi-view posed images by minimizing a reconstruction loss.
  • NeRF can be conditioned on an input image by passing it through an encoder and MLPs to predict appearance and geometry.

Feature distillation from foundation models

  • NeRFs can be extended to predict other quantities of interest
  • SemanticNeRF and PanopticNeRF predict segmentation labels
  • This paper aims to transfer knowledge from a pre-trained foundation model to NeRFs
  • A branch is added to output a high-dimensional feature vector
  • FeatureNeRF is trained jointly for image reconstruction and feature distillation

Learning internal nerf features for 3d semantic understanding

  • Features are distilled from foundation models for 3D semantic understanding.
  • A new coordinate loss is introduced to learn spatial-aware NeRF features.
  • Features from intermediate layers are used as continuous 3D visual descriptors.

Applications of featurenerf

  • Learned 3D semantic NeRF feature function is effective for various downstream applications
  • Simple, zero-shot methodologies are used to validate the proposed representations
  • Cosine similarity is used to measure distance in the feature space
  • FeatureNeRF model can be leveraged to 3D tasks

Experiments

Experimental setting

  • Experiments conducted on 6 categories from ShapeNet dataset
  • Evaluated using annotations from KeypointNet, ShapeNet part dataset and PartNet
  • Trained on CO3D dataset and evaluated keypoint transfer performance on Spair dataset
  • Shapes normalized so longest edges of bounding box are equal
  • 50 random camera poses from upper hemisphere sampled for training set
  • 50 fixed camera poses used for validation and testing sets
  • RGB images rendered with resolution of 128x128
  • Level-1 annotations used for PartNet
  • Compared to DINO and Latent Diffusion models
  • Patch size of DINO is 8, final feature map has dimension 32x32x384
  • Language prompts used to condition denoising modules are class names from ShapeNet

2d semantic understanding tasks

  • Evaluate FeatureNeRF on tasks of 2D keypoints transfer and part segmentation labels transfer
  • Report percentage of predicted keypoints whose distances from their corresponding ground truths are below thresholds of (2.5, 5.0, 7.5, 10.0) pixels in the target image
  • Calculate mean intersection over union (mIoU) over every part category for each object class
  • FeatureNeRF distilled with DINO features significantly outperforms other approaches for all 6 categories
  • FeatureNeRF learns a 3D representation from 2D observations
  • FeatureNeRF produces better boundaries and preserves details like small parts
  • Fine-tune model on CO3D datasets and evaluate keypoints transfer performance on two categories from the SPair dataset

3d semantic understanding tasks

  • Proposed method is validated on 3D semantic understanding tasks
  • Metrics for 3D tasks are 3D versions of their counterparts in 2D tasks
  • Method outperforms all baselines in 3D tasks
  • Method is able to transfer 3D keypoints and segmentation labels accurately

Ablation study

  • Ablated coordinate loss and internal NeRF features for semantic understanding
  • Conducted experiments of 2D parts co-segmentation on Chair and Plane classes
  • Results show design choices boost performance
  • Compared performance of novel-view synthesis with PixelNeRF, found comparable performance

Conclusion

  • Presents FeatureNeRF, a unified framework for learning generalizable NeRFs from distilling 2D vision foundation models
  • Explores use of internal NeRF features as 3D visual descriptors
  • Predicts 3D semantic feature volume from single image
  • Demonstrates effectiveness on 2D/3D keypoints transfer and part co-segmentation
  • Compares FeatureNeRF to PWarpC, outperforms under smaller pixel distances
  • Leverages 3D feature volume for editing applications, example of 3D part texture swapping
  • Ablation study shows coordinate loss and use of internal NeRF features boost performance
  • Comparable performance with PixelNeRF on novel-view synthesis