Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Finding correspondences between images of the same object is important for understanding its geometry.
- Recent years have seen progress in this area due to deep learning based local image features and learnable matchers.
- Learnable matchers often underperform when there is only small regions of co-visibility between image pairs.
- We propose a Learnable Feature Matching framework that uses models based on graph neural networks and integrates 3D signals to boost correspondence estimation.
- We experiment with two different 3D signals and evaluate our method on large-scale datasets.
- We observe strong feature matching improvements compared to 2D-only methods.
- The improved correspondences lead to higher relative posing accuracy for in-the-wild image pairs.
Paper Content
Introduction
- Matching images of the same object or scene is a core task in computer vision
- Many applications depend on this task, from augmented/virtual reality to novel view synthesis
- Traditionally, hand-crafted techniques such as SIFT or SURF were used to detect keypoints and describe them
- Recently, deep learning based techniques have been proposed to replace hand-crafted methods
- Learnable feature matcher models were proposed to replace hand-crafted matching heuristics
- These methods still encounter difficulties in many cases
- We propose to address these difficulties by integrating estimated 3D signals to help guide local feature matching
Related work
- Traditional local feature extraction and matching relies on finding low-level patterns and uses nearest neighbor matching
- Deep learning based local features are end-to-end learned and can be trained with different levels of supervision
- Learnable feature matching uses graph neural networks and optimal transport optimization to associate sparse local features
- Feature matching leveraging 3D information can improve the quality of extracted local features and disambiguate occlusions
Lfm-3d
Background
- Learnable sparse feature matching uses additional information during the matching process, such as keypoint locations and global image context
- SuperGlue is a graph neural network model architecture that predicts correspondences between a pair of sparse keypoint sets
- Normalized Object Coordinate Space (NOCS) is a 3D space defined over the unit cube that can be used to represent the geometry of all instances of an object class in a canonical orientation
- Monocular Depth Estimation (MDE) is the task of predicting the depth value of each pixel given a single RGB image as the input
- MDE provides sufficient 3D signals to boost correspondence estimation
Learning to match using 3d signals
- Method to integrate 3D signals into feature matching
- 3D signals are extracted as a dense feature map
- Local features have information about 2D pixel coordinates, local image content, and predicted 3D signal
- 3D signals are used to improve each keypoint’s matching representation
- Positional encoding is used to encode 3D signals before passing them onto the MLP
- Model is trained on synthetic homography and outdoor scene multi-view datasets
- Model is finetuned on object-centric data augmented with 3D estimates
Experiments
Datasets
- Experiments focus on object-centric image pairs taken from varying angles
- Use synthetic renderings of 3D models for training, evaluate on synthetic and real-world image pairs
- Use Google Scanned Objects dataset to train learnable matcher components
- Use Objectron dataset for evaluation purposes
- Sample image pairs at wide baselines to evaluate matcher performance in challenging scenarios
Acquiring estimated 3d signals
- We provide experiments on two types of 3D estimates: NOCS maps and MDE inverse depth maps.
- We train a new NOCS map regression network for the shoe object class.
- We use an off-the-shelf, class-agnostic monocular depth estimation model to predict depth maps for the camera object class.
- We use the publicly available DPT model to predict depth maps.
Experimental setup
- Focus experiments on two object classes: shoes and cameras
- Report results on match-level statistics for Google Scanned Objects dataset
- Calculate precision/recall at match level
- Evaluate correspondences on downstream task of relative pose estimation
- Compare LFM-3D to popular 2D-based sparse matching techniques
- Architecture and training details: SuperPoint local features, MLP 3D architecture, initial learning rate of 8x10^-5, trained for 750000 iterations
Results
- Feature-level matching & model ablations used to compare LFM-3D against model variants
- Finetuning, usage of 3D signals, and improved positional encoding used
- LFM-3D outperforms all baseline methods
- Relative pose estimation evaluated on Objectron dataset
- LFM-3D outperforms conventional and state-of-the-art sparse feature matching techniques
- Qualitative results show improved correspondences between images with small overlapping visible regions
- Limitations and failure scenarios observed with feature-less objects, irregular geometries, and out-of-distribution geometries
Conclusions
- Novel method for local feature matching integrates estimated 3D signals
- 3D signal incorporation matters, improvements observed with positional encodings
- 3D information improves estimation of image correspondences
- Model weights from github used
- Compared to other baseline methods for relative pose estimation
- Official SP+SG baseline, SP+SG w/ NOCS filtering, LoFTR, LFM-3D
- LFM-3D outperforms all baselines for 3 challenging Acc@N error thresholds
- LFM-3D benefits from additional keypoints more than 2D-only baselines
- Qualitative results show LFM-3D performs well across shapes, textures, lighting
- Ablation study over number of extracted keypoints