Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Local feature matching between images is difficult, especially when there are significant appearance variations.
DeepMatcher is a deep Transformer-based network that captures more human-intuitive and simpler-to-match features.
SlimFormer leverages vector-based attention to model relevance among all keypoints and relative position encoding is applied to each SlimFormer.
Feature Transition Module (FTM) and Fine Matches Module are used to generate robust and accurate matches.
DeepMatcher outperforms the state-of-the-art methods on several benchmarks.

Paper Content

I. introduction

Local feature matching is a prerequisite for a variety of computer vision applications
Detector-based matching is typically accomplished by detecting and describing a set of sparse keypoints
Detector-free matching seeks to establish correspondences directly from original images by extracting visual descriptors on dense grids
Transformer has recently attracted considerable interest in computer vision
LoFTR updates features by repeatedly interleaving the self-and cross-attention layers
Intriguing issue arises: could we build a deeper yet compact local feature matcher?
Obstacles hinder us from developing a deep local feature matcher for detector-free methods
Proposed DeepMatcher to produce more human-intuitive and simpler-to-match features
DeepMatcher utilizes a CNN network, Feature Transition Module, Slimming Transformer, and Coarse/Fine Matches Module

A. detector-based methods

Conventional pipeline of detector-based matching systems detects two sets of keypoints and describes them with high-dimensional vectors
Handcrafted descriptors are fragile when dealing with image pairs with extreme appearance variations
Deep learning used to extract robust feature representations
SuperPoint builds a dataset of pseudoground truth interest point locations
D2-Net makes collected keypoints more stable
Nearest neighbor search and robust estimator used to find matches between retrieved keypoints
Local feature matching interpreted as graph matching problem involving two sets of features
Self-and cross-attention layers in Transformer used to exchange global visual and geometric messages across nodes
Sinkhorn algorithm used to generate matches according to soft assignment matrixes
Matrix multiplication in vanilla Transformer results in quadratic complexity
SGMNet and ClusterGNN attempt to ameliorate structure of SuperGlue
Detector-based approaches unable to extract repeated keypoints when dealing with large appearance variations

B. detector-free methods

Detector-free methods generate dense matches directly from original images
Earlier detector-free matching researches use CNN based on correlation or cost volume
Matchformer proposes a human-intuitive extract-and-match scheme
QuadTree proposes a novel Transformer structure
TopicFM applies a topic-modeling strategy to encode high-level contexts
ASpanFormer proposes a Transformer-based detector-free architecture
Detector-free methods generate repeatable keypoints in indistinct regions

C. efficient transformer

Vanilla Transformer has quadratic memory cost for long sequences.
Several approaches have been proposed to improve Transformer efficiency.

Iii. methodology

A. overall

Humans match images by scanning them back and forth.
DeepMatcher is a deep Transformer-based network.
Features are extracted from images and fed into SlimFormer.
Relative position encoding is used to enhance DeepMatcher’s ability.
Layer-scale strategy is used to simulate human behavior.
Coarse matches are established and optimized to fine matches.

B. local feature extractor

Standard convolutional neural network (CNN) with FPN used to extract coarse-level and fine-level features from image pair
Pixel coordinates of keypoints viewed as central position of 8x8 grids in original images

C. feature transition module (ftm)

Construct graph neural network (GNN) and propose SlimFormer
Gap between feature extractor and SlimFormer in terms of context ranges
Representing features at multiple scales is critical for discriminating objects
Feature Transition Module (FTM) inserted between local feature extractor and SlimFormer to adjust receptive fields of extracted features

D. slimming transformer (slimformer)

Flattened enhanced features are used as input sequence for deep feature aggregation
Keypoints with features in image pairs are viewed as nodes to construct GNN
More observations between images can result in more precise matches
Deep feature interaction is essential for local features matching task
Position information disappears when Transformer layers grow deeper
Humans associate objects by referring to their relative positions
Linear Transformer used in LoFTR uses context-agnostic manner to approximate self-attention
SlimFormer leverages relative position information and global context information to boost DeepMatcher
Vector-based Attention Layer models long-range interactions among pixel tokens
Feed-forward Network extracts discriminative features for deep features aggregation
Layer Scale Strategy balances original features and enhanced message
Relative Position Encoding (RPE) is used to distinguish identical features
Self-/Cross-SlimFormer integrates intra-/inter-image information
Soft assignment score and Mutual Nearest Neighbor (MNN) criteria are used to derive coarse matches

F. fine matches module (fmm)

Established coarse matches are refined to original picture resolution using a coarse-to-fine module.
Match refinement is a combination of classification and regression problems.
A network is used to predict the offset and confidence of the predicted coarse matches.

G. loss

DeepMatcher generates final dense matches according to soft assignment matrix G and offset ∆
Total loss L comprises of matching loss Lm, regression loss Lr, and classification loss Lc
Matching loss is calculated using focal loss and ground truth matches
Regression loss is calculated using predicted matches and ground truth offset
Classification loss is calculated using predicted matches with ground truth offset less than predefined threshold
Training scheme for Scannet uses AdamW solver, gradient clipping, and learning rate warm-up
Training scheme for MegaDepth uses AdamW solver, linear learning rate warm-up, and learning rate decay

B. indoor pose estimation

Indoor pose estimation task is hampered by motion blur and significant viewpoint shifts
ScanNet dataset used to validate effectiveness of DeepMatcher on indoor pose estimation task
Evaluation protocol used to report area under cumulative curve (AUC) of pose errors at thresholds (5•, 10•, 20•)
Detector-free methods achieve superior performance than detector-based methods
DeepMatcher and DeepMatcher-L outperform all cutting-edge detector-based and detector-free methods

Local features

DeepMatcher-L outperforms SuperPoint and DenseGAP in terms of AUC@5, 10, 20
DeepMatcher-L outperforms LoFTR by 5.26%, 5.45%, 4.87%
DeepMatcher-L outperforms ASpanFormer by 1.72%, 0.25%
DeepMatcher-L consumes 77.65% GFLOPs and is 26.95% faster than ASpanFormer

C. outdoor pose estimation

Outdoor pose estimation is a challenging task due to 3D geometry, illumination and viewpoint changes.
An outdoor pose estimation experiment was conducted using the MegaDepth dataset which contains 1M images from 196 scenes.
DeepMatcher families outperformed other methods in all evaluation metrics.

D. image matching

Image matching plays an important role in computer science applications
Experiment conducted on HPatches dataset
Mean matching accuracy (MMA) used as metric
DeepMatcher families outperform detector-free methods
DeepMatcher yields inferior performance at low thresholds, better performance at higher thresholds
DeepMatcher exhibits superior robustness to viewpoint variations

E. homography estimation

Experiment conducted to evaluate performance of DeepMatcher
Dataset used: HPatches
Evaluation Protocol: Corner Correctness Metric (CCM)
Results: DeepMatcher outperforms other methods when handling extreme viewpoint changes and illumination changes

F. understanding deepmatcher

DeepMatcher-L achieves dense and accurate matching performance
Interleaving SlimFormer can effectively integrate intra-/inter-image information
DeepMatcher families have competitive inference speed and minimum computational complexity of the attention layer
SlimFormer pays attention to prominent keypoints at object boundaries

G. ablation study

DeepMatcher performance is improved by all components
Layer-scale strategy simulates human behaviour
Relative position encoding is more conducive to scene parsing
FMM achieves superior performance compared to coarse-to-fine module

Link to paper#

Abstract#

Paper Content#

I. introduction#

Ii. related work#

A. detector-based methods#

B. detector-free methods#

C. efficient transformer#

Iii. methodology#

A. overall#

B. local feature extractor#

C. feature transition module (ftm)#

D. slimming transformer (slimformer)#

F. fine matches module (fmm)#

G. loss#

B. indoor pose estimation#

Local features#

C. outdoor pose estimation#

D. image matching#

E. homography estimation#

F. understanding deepmatcher#

G. ablation study#

Link to paper

Abstract

Paper Content

I. introduction

Ii. related work

A. detector-based methods

B. detector-free methods

C. efficient transformer

Iii. methodology

A. overall

B. local feature extractor

C. feature transition module (ftm)

D. slimming transformer (slimformer)

F. fine matches module (fmm)

G. loss

B. indoor pose estimation

Local features

C. outdoor pose estimation

D. image matching

E. homography estimation

F. understanding deepmatcher

G. ablation study