Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Local feature matching between images is difficult, especially when there are significant appearance variations.
- DeepMatcher is a deep Transformer-based network that captures more human-intuitive and simpler-to-match features.
- SlimFormer leverages vector-based attention to model relevance among all keypoints and relative position encoding is applied to each SlimFormer.
- Feature Transition Module (FTM) and Fine Matches Module are used to generate robust and accurate matches.
- DeepMatcher outperforms the state-of-the-art methods on several benchmarks.
Paper Content
I. introduction
- Local feature matching is a prerequisite for a variety of computer vision applications
- Detector-based matching is typically accomplished by detecting and describing a set of sparse keypoints
- Detector-free matching seeks to establish correspondences directly from original images by extracting visual descriptors on dense grids
- Transformer has recently attracted considerable interest in computer vision
- LoFTR updates features by repeatedly interleaving the self-and cross-attention layers
- Intriguing issue arises: could we build a deeper yet compact local feature matcher?
- Obstacles hinder us from developing a deep local feature matcher for detector-free methods
- Proposed DeepMatcher to produce more human-intuitive and simpler-to-match features
- DeepMatcher utilizes a CNN network, Feature Transition Module, Slimming Transformer, and Coarse/Fine Matches Module
Ii. related work
A. detector-based methods
- Conventional pipeline of detector-based matching systems detects two sets of keypoints and describes them with high-dimensional vectors
- Handcrafted descriptors are fragile when dealing with image pairs with extreme appearance variations
- Deep learning used to extract robust feature representations
- SuperPoint builds a dataset of pseudoground truth interest point locations
- D2-Net makes collected keypoints more stable
- Nearest neighbor search and robust estimator used to find matches between retrieved keypoints
- Local feature matching interpreted as graph matching problem involving two sets of features
- Self-and cross-attention layers in Transformer used to exchange global visual and geometric messages across nodes
- Sinkhorn algorithm used to generate matches according to soft assignment matrixes
- Matrix multiplication in vanilla Transformer results in quadratic complexity
- SGMNet and ClusterGNN attempt to ameliorate structure of SuperGlue
- Detector-based approaches unable to extract repeated keypoints when dealing with large appearance variations
B. detector-free methods
- Detector-free methods generate dense matches directly from original images
- Earlier detector-free matching researches use CNN based on correlation or cost volume
- Matchformer proposes a human-intuitive extract-and-match scheme
- QuadTree proposes a novel Transformer structure
- TopicFM applies a topic-modeling strategy to encode high-level contexts
- ASpanFormer proposes a Transformer-based detector-free architecture
- Detector-free methods generate repeatable keypoints in indistinct regions
C. efficient transformer
- Vanilla Transformer has quadratic memory cost for long sequences.
- Several approaches have been proposed to improve Transformer efficiency.
Iii. methodology
A. overall
- Humans match images by scanning them back and forth.
- DeepMatcher is a deep Transformer-based network.
- Features are extracted from images and fed into SlimFormer.
- Relative position encoding is used to enhance DeepMatcher’s ability.
- Layer-scale strategy is used to simulate human behavior.
- Coarse matches are established and optimized to fine matches.
B. local feature extractor
- Standard convolutional neural network (CNN) with FPN used to extract coarse-level and fine-level features from image pair
- Pixel coordinates of keypoints viewed as central position of 8x8 grids in original images
C. feature transition module (ftm)
- Construct graph neural network (GNN) and propose SlimFormer
- Gap between feature extractor and SlimFormer in terms of context ranges
- Representing features at multiple scales is critical for discriminating objects
- Feature Transition Module (FTM) inserted between local feature extractor and SlimFormer to adjust receptive fields of extracted features
D. slimming transformer (slimformer)
- Flattened enhanced features are used as input sequence for deep feature aggregation
- Keypoints with features in image pairs are viewed as nodes to construct GNN
- More observations between images can result in more precise matches
- Deep feature interaction is essential for local features matching task
- Position information disappears when Transformer layers grow deeper
- Humans associate objects by referring to their relative positions
- Linear Transformer used in LoFTR uses context-agnostic manner to approximate self-attention
- SlimFormer leverages relative position information and global context information to boost DeepMatcher
- Vector-based Attention Layer models long-range interactions among pixel tokens
- Feed-forward Network extracts discriminative features for deep features aggregation
- Layer Scale Strategy balances original features and enhanced message
- Relative Position Encoding (RPE) is used to distinguish identical features
- Self-/Cross-SlimFormer integrates intra-/inter-image information
- Soft assignment score and Mutual Nearest Neighbor (MNN) criteria are used to derive coarse matches
F. fine matches module (fmm)
- Established coarse matches are refined to original picture resolution using a coarse-to-fine module.
- Match refinement is a combination of classification and regression problems.
- A network is used to predict the offset and confidence of the predicted coarse matches.
G. loss
- DeepMatcher generates final dense matches according to soft assignment matrix G and offset ∆
- Total loss L comprises of matching loss Lm, regression loss Lr, and classification loss Lc
- Matching loss is calculated using focal loss and ground truth matches
- Regression loss is calculated using predicted matches and ground truth offset
- Classification loss is calculated using predicted matches with ground truth offset less than predefined threshold
- Training scheme for Scannet uses AdamW solver, gradient clipping, and learning rate warm-up
- Training scheme for MegaDepth uses AdamW solver, linear learning rate warm-up, and learning rate decay
B. indoor pose estimation
- Indoor pose estimation task is hampered by motion blur and significant viewpoint shifts
- ScanNet dataset used to validate effectiveness of DeepMatcher on indoor pose estimation task
- Evaluation protocol used to report area under cumulative curve (AUC) of pose errors at thresholds (5•, 10•, 20•)
- Detector-free methods achieve superior performance than detector-based methods
- DeepMatcher and DeepMatcher-L outperform all cutting-edge detector-based and detector-free methods
Local features
- DeepMatcher-L outperforms SuperPoint and DenseGAP in terms of AUC@5, 10, 20
- DeepMatcher-L outperforms LoFTR by 5.26%, 5.45%, 4.87%
- DeepMatcher-L outperforms ASpanFormer by 1.72%, 0.25%
- DeepMatcher-L consumes 77.65% GFLOPs and is 26.95% faster than ASpanFormer
C. outdoor pose estimation
- Outdoor pose estimation is a challenging task due to 3D geometry, illumination and viewpoint changes.
- An outdoor pose estimation experiment was conducted using the MegaDepth dataset which contains 1M images from 196 scenes.
- DeepMatcher families outperformed other methods in all evaluation metrics.
D. image matching
- Image matching plays an important role in computer science applications
- Experiment conducted on HPatches dataset
- Mean matching accuracy (MMA) used as metric
- DeepMatcher families outperform detector-free methods
- DeepMatcher yields inferior performance at low thresholds, better performance at higher thresholds
- DeepMatcher exhibits superior robustness to viewpoint variations
E. homography estimation
- Experiment conducted to evaluate performance of DeepMatcher
- Dataset used: HPatches
- Evaluation Protocol: Corner Correctness Metric (CCM)
- Results: DeepMatcher outperforms other methods when handling extreme viewpoint changes and illumination changes
F. understanding deepmatcher
- DeepMatcher-L achieves dense and accurate matching performance
- Interleaving SlimFormer can effectively integrate intra-/inter-image information
- DeepMatcher families have competitive inference speed and minimum computational complexity of the attention layer
- SlimFormer pays attention to prominent keypoints at object boundaries
G. ablation study
- DeepMatcher performance is improved by all components
- Layer-scale strategy simulates human behaviour
- Relative position encoding is more conducive to scene parsing
- FMM achieves superior performance compared to coarse-to-fine module