Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Finding correspondences between images of the same object is important for understanding its geometry.
  • Recent years have seen progress in this area due to deep learning based local image features and learnable matchers.
  • Learnable matchers often underperform when there is only small regions of co-visibility between image pairs.
  • We propose a Learnable Feature Matching framework that uses models based on graph neural networks and integrates 3D signals to boost correspondence estimation.
  • We experiment with two different 3D signals and evaluate our method on large-scale datasets.
  • We observe strong feature matching improvements compared to 2D-only methods.
  • The improved correspondences lead to higher relative posing accuracy for in-the-wild image pairs.

Paper Content

Introduction

  • Matching images of the same object or scene is a core task in computer vision
  • Many applications depend on this task, from augmented/virtual reality to novel view synthesis
  • Traditionally, hand-crafted techniques such as SIFT or SURF were used to detect keypoints and describe them
  • Recently, deep learning based techniques have been proposed to replace hand-crafted methods
  • Learnable feature matcher models were proposed to replace hand-crafted matching heuristics
  • These methods still encounter difficulties in many cases
  • We propose to address these difficulties by integrating estimated 3D signals to help guide local feature matching
  • Traditional local feature extraction and matching relies on finding low-level patterns and uses nearest neighbor matching
  • Deep learning based local features are end-to-end learned and can be trained with different levels of supervision
  • Learnable feature matching uses graph neural networks and optimal transport optimization to associate sparse local features
  • Feature matching leveraging 3D information can improve the quality of extracted local features and disambiguate occlusions

Lfm-3d

Background

  • Learnable sparse feature matching uses additional information during the matching process, such as keypoint locations and global image context
  • SuperGlue is a graph neural network model architecture that predicts correspondences between a pair of sparse keypoint sets
  • Normalized Object Coordinate Space (NOCS) is a 3D space defined over the unit cube that can be used to represent the geometry of all instances of an object class in a canonical orientation
  • Monocular Depth Estimation (MDE) is the task of predicting the depth value of each pixel given a single RGB image as the input
  • MDE provides sufficient 3D signals to boost correspondence estimation

Learning to match using 3d signals

  • Method to integrate 3D signals into feature matching
  • 3D signals are extracted as a dense feature map
  • Local features have information about 2D pixel coordinates, local image content, and predicted 3D signal
  • 3D signals are used to improve each keypoint’s matching representation
  • Positional encoding is used to encode 3D signals before passing them onto the MLP
  • Model is trained on synthetic homography and outdoor scene multi-view datasets
  • Model is finetuned on object-centric data augmented with 3D estimates

Experiments

Datasets

  • Experiments focus on object-centric image pairs taken from varying angles
  • Use synthetic renderings of 3D models for training, evaluate on synthetic and real-world image pairs
  • Use Google Scanned Objects dataset to train learnable matcher components
  • Use Objectron dataset for evaluation purposes
  • Sample image pairs at wide baselines to evaluate matcher performance in challenging scenarios

Acquiring estimated 3d signals

  • We provide experiments on two types of 3D estimates: NOCS maps and MDE inverse depth maps.
  • We train a new NOCS map regression network for the shoe object class.
  • We use an off-the-shelf, class-agnostic monocular depth estimation model to predict depth maps for the camera object class.
  • We use the publicly available DPT model to predict depth maps.

Experimental setup

  • Focus experiments on two object classes: shoes and cameras
  • Report results on match-level statistics for Google Scanned Objects dataset
  • Calculate precision/recall at match level
  • Evaluate correspondences on downstream task of relative pose estimation
  • Compare LFM-3D to popular 2D-based sparse matching techniques
  • Architecture and training details: SuperPoint local features, MLP 3D architecture, initial learning rate of 8x10^-5, trained for 750000 iterations

Results

  • Feature-level matching & model ablations used to compare LFM-3D against model variants
  • Finetuning, usage of 3D signals, and improved positional encoding used
  • LFM-3D outperforms all baseline methods
  • Relative pose estimation evaluated on Objectron dataset
  • LFM-3D outperforms conventional and state-of-the-art sparse feature matching techniques
  • Qualitative results show improved correspondences between images with small overlapping visible regions
  • Limitations and failure scenarios observed with feature-less objects, irregular geometries, and out-of-distribution geometries

Conclusions

  • Novel method for local feature matching integrates estimated 3D signals
  • 3D signal incorporation matters, improvements observed with positional encodings
  • 3D information improves estimation of image correspondences
  • Model weights from github used
  • Compared to other baseline methods for relative pose estimation
  • Official SP+SG baseline, SP+SG w/ NOCS filtering, LoFTR, LFM-3D
  • LFM-3D outperforms all baselines for 3 challenging Acc@N error thresholds
  • LFM-3D benefits from additional keypoints more than 2D-only baselines
  • Qualitative results show LFM-3D performs well across shapes, textures, lighting
  • Ablation study over number of extracted keypoints