Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Recent works have shown promising results on novel view synthesis from single or few images.
- Models have rarely been applied on other downstream tasks beyond synthesis such as semantic understanding and parsing.
- Proposed framework named FeatureNeRF to learn generalizable NeRFs by distilling pre-trained vision foundation models.
- FeatureNeRF maps 2D images to continuous 3D semantic feature volumes, which can be used for various downstream tasks.
- Evaluated on tasks of 2D/3D semantic keypoint transfer and 2D/3D object part segmentation.
- Extensive experiments demonstrate effectiveness of FeatureNeRF as a generalizable 3D semantic feature extractor.
Paper Content
Introduction
- Neural fields are used to represent visual signals
- Neural Radiance Fields (NeRF) can synthesize views from dense input images
- Followup works reduce dependency on dense inputs and generalize NeRF to unseen objects
- Question: Can NeRF be adapted to learn 3D representations for general 3D applications?
- Recent years have seen rise of vision foundation models that are pre-trained on web-scale image datasets
- Goal: Leverage 2D foundation models to obtain generalizable 3D features
- FeatureNeRF: Unified framework for learning generalizable NeRFs from distilling pretrained 2D vision foundation models
- FeatureNeRF extracts deep features from NeRFs as generalizable 3D visual descriptors
- FeatureNeRF can predict 3D semantic feature volume from single or few images
- Used for tasks such as semantic keypoint transfer and object part co-segmentation
- Evaluated with two foundation models: DINO and Latent Diffusion
Related work
- Neural fields have been used to represent various visual signals
- NeRF is used for photo-realistic novel view synthesis
- Follow-up methods have been proposed to learn generalizable NeRFs from large-scale multi-view image datasets
- Foundation models have been used to transfer to various vision tasks
- Distillation has been used for model compression and knowledge transfer
- Semantic correspondences learning aims to find corresponding points between visual observations
- Our work is the first to address semantic correspondences in both 2D and 3D space with only 2D observations
Method
- Presents FeatureNeRF, a framework for learning generalizable NeRF from vision foundation models
- Explains feature distillation process
- Introduces how to learn internal NeRF features for 3D semantic understanding and downstream applications
Preliminary: generalizable nerf
- Neural Radiance Fields (NeRF) consists of two functions that map 3D points to density and color.
- NeRF is optimized to a single scene with multi-view posed images by minimizing a reconstruction loss.
- NeRF can be conditioned on an input image by passing it through an encoder and MLPs to predict appearance and geometry.
Feature distillation from foundation models
- NeRFs can be extended to predict other quantities of interest
- SemanticNeRF and PanopticNeRF predict segmentation labels
- This paper aims to transfer knowledge from a pre-trained foundation model to NeRFs
- A branch is added to output a high-dimensional feature vector
- FeatureNeRF is trained jointly for image reconstruction and feature distillation
Learning internal nerf features for 3d semantic understanding
- Features are distilled from foundation models for 3D semantic understanding.
- A new coordinate loss is introduced to learn spatial-aware NeRF features.
- Features from intermediate layers are used as continuous 3D visual descriptors.
Applications of featurenerf
- Learned 3D semantic NeRF feature function is effective for various downstream applications
- Simple, zero-shot methodologies are used to validate the proposed representations
- Cosine similarity is used to measure distance in the feature space
- FeatureNeRF model can be leveraged to 3D tasks
Experiments
Experimental setting
- Experiments conducted on 6 categories from ShapeNet dataset
- Evaluated using annotations from KeypointNet, ShapeNet part dataset and PartNet
- Trained on CO3D dataset and evaluated keypoint transfer performance on Spair dataset
- Shapes normalized so longest edges of bounding box are equal
- 50 random camera poses from upper hemisphere sampled for training set
- 50 fixed camera poses used for validation and testing sets
- RGB images rendered with resolution of 128x128
- Level-1 annotations used for PartNet
- Compared to DINO and Latent Diffusion models
- Patch size of DINO is 8, final feature map has dimension 32x32x384
- Language prompts used to condition denoising modules are class names from ShapeNet
2d semantic understanding tasks
- Evaluate FeatureNeRF on tasks of 2D keypoints transfer and part segmentation labels transfer
- Report percentage of predicted keypoints whose distances from their corresponding ground truths are below thresholds of (2.5, 5.0, 7.5, 10.0) pixels in the target image
- Calculate mean intersection over union (mIoU) over every part category for each object class
- FeatureNeRF distilled with DINO features significantly outperforms other approaches for all 6 categories
- FeatureNeRF learns a 3D representation from 2D observations
- FeatureNeRF produces better boundaries and preserves details like small parts
- Fine-tune model on CO3D datasets and evaluate keypoints transfer performance on two categories from the SPair dataset
3d semantic understanding tasks
- Proposed method is validated on 3D semantic understanding tasks
- Metrics for 3D tasks are 3D versions of their counterparts in 2D tasks
- Method outperforms all baselines in 3D tasks
- Method is able to transfer 3D keypoints and segmentation labels accurately
Ablation study
- Ablated coordinate loss and internal NeRF features for semantic understanding
- Conducted experiments of 2D parts co-segmentation on Chair and Plane classes
- Results show design choices boost performance
- Compared performance of novel-view synthesis with PixelNeRF, found comparable performance
Conclusion
- Presents FeatureNeRF, a unified framework for learning generalizable NeRFs from distilling 2D vision foundation models
- Explores use of internal NeRF features as 3D visual descriptors
- Predicts 3D semantic feature volume from single image
- Demonstrates effectiveness on 2D/3D keypoints transfer and part co-segmentation
- Compares FeatureNeRF to PWarpC, outperforms under smaller pixel distances
- Leverages 3D feature volume for editing applications, example of 3D part texture swapping
- Ablation study shows coordinate loss and use of internal NeRF features boost performance
- Comparable performance with PixelNeRF on novel-view synthesis