Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Local feature matching between images is difficult, especially when there are significant appearance variations.
  • DeepMatcher is a deep Transformer-based network that captures more human-intuitive and simpler-to-match features.
  • SlimFormer leverages vector-based attention to model relevance among all keypoints and relative position encoding is applied to each SlimFormer.
  • Feature Transition Module (FTM) and Fine Matches Module are used to generate robust and accurate matches.
  • DeepMatcher outperforms the state-of-the-art methods on several benchmarks.

Paper Content

I. introduction

  • Local feature matching is a prerequisite for a variety of computer vision applications
  • Detector-based matching is typically accomplished by detecting and describing a set of sparse keypoints
  • Detector-free matching seeks to establish correspondences directly from original images by extracting visual descriptors on dense grids
  • Transformer has recently attracted considerable interest in computer vision
  • LoFTR updates features by repeatedly interleaving the self-and cross-attention layers
  • Intriguing issue arises: could we build a deeper yet compact local feature matcher?
  • Obstacles hinder us from developing a deep local feature matcher for detector-free methods
  • Proposed DeepMatcher to produce more human-intuitive and simpler-to-match features
  • DeepMatcher utilizes a CNN network, Feature Transition Module, Slimming Transformer, and Coarse/Fine Matches Module

A. detector-based methods

  • Conventional pipeline of detector-based matching systems detects two sets of keypoints and describes them with high-dimensional vectors
  • Handcrafted descriptors are fragile when dealing with image pairs with extreme appearance variations
  • Deep learning used to extract robust feature representations
  • SuperPoint builds a dataset of pseudoground truth interest point locations
  • D2-Net makes collected keypoints more stable
  • Nearest neighbor search and robust estimator used to find matches between retrieved keypoints
  • Local feature matching interpreted as graph matching problem involving two sets of features
  • Self-and cross-attention layers in Transformer used to exchange global visual and geometric messages across nodes
  • Sinkhorn algorithm used to generate matches according to soft assignment matrixes
  • Matrix multiplication in vanilla Transformer results in quadratic complexity
  • SGMNet and ClusterGNN attempt to ameliorate structure of SuperGlue
  • Detector-based approaches unable to extract repeated keypoints when dealing with large appearance variations

B. detector-free methods

  • Detector-free methods generate dense matches directly from original images
  • Earlier detector-free matching researches use CNN based on correlation or cost volume
  • Matchformer proposes a human-intuitive extract-and-match scheme
  • QuadTree proposes a novel Transformer structure
  • TopicFM applies a topic-modeling strategy to encode high-level contexts
  • ASpanFormer proposes a Transformer-based detector-free architecture
  • Detector-free methods generate repeatable keypoints in indistinct regions

C. efficient transformer

  • Vanilla Transformer has quadratic memory cost for long sequences.
  • Several approaches have been proposed to improve Transformer efficiency.

Iii. methodology

A. overall

  • Humans match images by scanning them back and forth.
  • DeepMatcher is a deep Transformer-based network.
  • Features are extracted from images and fed into SlimFormer.
  • Relative position encoding is used to enhance DeepMatcher’s ability.
  • Layer-scale strategy is used to simulate human behavior.
  • Coarse matches are established and optimized to fine matches.

B. local feature extractor

  • Standard convolutional neural network (CNN) with FPN used to extract coarse-level and fine-level features from image pair
  • Pixel coordinates of keypoints viewed as central position of 8x8 grids in original images

C. feature transition module (ftm)

  • Construct graph neural network (GNN) and propose SlimFormer
  • Gap between feature extractor and SlimFormer in terms of context ranges
  • Representing features at multiple scales is critical for discriminating objects
  • Feature Transition Module (FTM) inserted between local feature extractor and SlimFormer to adjust receptive fields of extracted features

D. slimming transformer (slimformer)

  • Flattened enhanced features are used as input sequence for deep feature aggregation
  • Keypoints with features in image pairs are viewed as nodes to construct GNN
  • More observations between images can result in more precise matches
  • Deep feature interaction is essential for local features matching task
  • Position information disappears when Transformer layers grow deeper
  • Humans associate objects by referring to their relative positions
  • Linear Transformer used in LoFTR uses context-agnostic manner to approximate self-attention
  • SlimFormer leverages relative position information and global context information to boost DeepMatcher
  • Vector-based Attention Layer models long-range interactions among pixel tokens
  • Feed-forward Network extracts discriminative features for deep features aggregation
  • Layer Scale Strategy balances original features and enhanced message
  • Relative Position Encoding (RPE) is used to distinguish identical features
  • Self-/Cross-SlimFormer integrates intra-/inter-image information
  • Soft assignment score and Mutual Nearest Neighbor (MNN) criteria are used to derive coarse matches

F. fine matches module (fmm)

  • Established coarse matches are refined to original picture resolution using a coarse-to-fine module.
  • Match refinement is a combination of classification and regression problems.
  • A network is used to predict the offset and confidence of the predicted coarse matches.

G. loss

  • DeepMatcher generates final dense matches according to soft assignment matrix G and offset ∆
  • Total loss L comprises of matching loss Lm, regression loss Lr, and classification loss Lc
  • Matching loss is calculated using focal loss and ground truth matches
  • Regression loss is calculated using predicted matches and ground truth offset
  • Classification loss is calculated using predicted matches with ground truth offset less than predefined threshold
  • Training scheme for Scannet uses AdamW solver, gradient clipping, and learning rate warm-up
  • Training scheme for MegaDepth uses AdamW solver, linear learning rate warm-up, and learning rate decay

B. indoor pose estimation

  • Indoor pose estimation task is hampered by motion blur and significant viewpoint shifts
  • ScanNet dataset used to validate effectiveness of DeepMatcher on indoor pose estimation task
  • Evaluation protocol used to report area under cumulative curve (AUC) of pose errors at thresholds (5•, 10•, 20•)
  • Detector-free methods achieve superior performance than detector-based methods
  • DeepMatcher and DeepMatcher-L outperform all cutting-edge detector-based and detector-free methods

Local features

  • DeepMatcher-L outperforms SuperPoint and DenseGAP in terms of AUC@5, 10, 20
  • DeepMatcher-L outperforms LoFTR by 5.26%, 5.45%, 4.87%
  • DeepMatcher-L outperforms ASpanFormer by 1.72%, 0.25%
  • DeepMatcher-L consumes 77.65% GFLOPs and is 26.95% faster than ASpanFormer

C. outdoor pose estimation

  • Outdoor pose estimation is a challenging task due to 3D geometry, illumination and viewpoint changes.
  • An outdoor pose estimation experiment was conducted using the MegaDepth dataset which contains 1M images from 196 scenes.
  • DeepMatcher families outperformed other methods in all evaluation metrics.

D. image matching

  • Image matching plays an important role in computer science applications
  • Experiment conducted on HPatches dataset
  • Mean matching accuracy (MMA) used as metric
  • DeepMatcher families outperform detector-free methods
  • DeepMatcher yields inferior performance at low thresholds, better performance at higher thresholds
  • DeepMatcher exhibits superior robustness to viewpoint variations

E. homography estimation

  • Experiment conducted to evaluate performance of DeepMatcher
  • Dataset used: HPatches
  • Evaluation Protocol: Corner Correctness Metric (CCM)
  • Results: DeepMatcher outperforms other methods when handling extreme viewpoint changes and illumination changes

F. understanding deepmatcher

  • DeepMatcher-L achieves dense and accurate matching performance
  • Interleaving SlimFormer can effectively integrate intra-/inter-image information
  • DeepMatcher families have competitive inference speed and minimum computational complexity of the attention layer
  • SlimFormer pays attention to prominent keypoints at object boundaries

G. ablation study

  • DeepMatcher performance is improved by all components
  • Layer-scale strategy simulates human behaviour
  • Relative position encoding is more conducive to scene parsing
  • FMM achieves superior performance compared to coarse-to-fine module