Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Heavy computation is a bottleneck for deep-learning based feature matching algorithms.
- Existing lightweight networks cannot address classical feature matching tasks.
- This paper proposes two concepts: ParaFormer and a graph based U-Net architecture with attentional pooling.
- ParaFormer fuses features and keypoint positions and integrates self- and cross-attention.
- U-Net architecture and proposed attentional pooling reduce computational complexity.
- Experiments demonstrate ParaFormer achieves state-of-the-art performance while maintaining high efficiency.
- ParaFormer-U variant achieves comparable performance with less than 50% FLOPs of existing attention-based models.
Paper Content
Introduction
- Feature matching is a fundamental problem for many computer vision tasks
- It is challenging to find invariance and get robust matches from two images
- Feature matching pipelines can be categorized into detector-based and detector-free methods
- Attention-based networks are dominant methods in both pipelines
- We propose parallel attention to compute self-and cross-attention synchronously
- We explore weight sharing and attention weight sharing strategies to reduce parameters and computations
- We propose attentional pooling to reduce FLOPs with minimal performance loss
Related works
- Classical feature matching is a detector-based pipeline
- Handcrafted methods and learning-based detectors have been proposed to improve robustness
- SuperGlue was the first to propose an attention-based feature matching network
- OETR further constrains attention-based feature matching
- LoFTR applies self-and cross-attention directly on feature maps
- MatchFormer abandons the CNN backbone and uses a hierarchical framework
- MatchFormer proposes an interleaving strategy for self-and cross-attention
- Position encoder allows the network to sense the relative or absolute position of each vector
- Position encoding methods use fixed sine and cosine functions or learnable parameters
- Relative position encoding adjusts attention weights with relative position
- Convolution-based position encoders use convolution to augment local features
- SuperGlue uses MLP to extend the coordinate vector to align with the descriptor
- U-Net architecture consists of an encoder-decoder structure
- Graph U-Nets propose the graph pooling layer to enable downsampling on graph data
Methodology
- Method dynamically fuses positions and descriptors in amplitude and phase manner
- Parallel attention module computes self and cross-attention synchronously
- Utilizes global information to enhance representation capability of features
- Enhanced descriptors are matched by optimal matching layer using Sinkhorn algorithm
Wave position encoder
- MLP-PE has limited encoding capacity
- Wave-PE is designed to dynamically adjust the relationship between descriptor and position
- Wave-PE is represented as a wave with amplitude and phase information
- Three learnable networks are used to estimate amplitude and phase and fuse real and imaginary parts into position encoding
Parallel attention
- Linear projection of two sets of descriptors to form Q, K, V
- Self-attention module uses standard attention computation
- Cross-attention module uses weight sharing strategy to improve efficiency
U-net architecture
- ParaFormer-U is designed for efficiency
- Spatial downsampling and upsampling are used to extract and recover information
- Attentional pooling is proposed for downsampling
- Attention map shows important context points in the image
- Unpooling operation inserts current feature matrix into empty feature matrix
Implementation details
- Pretrained on R1M dataset
- Finetuned on MegaDepth dataset
- AdamW optimizer used
- Batch size of 8 and initial learning rate of 0.0001
- ParaFormer outperforms outlier rejection methods and attention-based matcher
- ParaFormer boosts performance by integrating self-and cross-attention
- ParaFormer-U alleviates computational complexity while maintaining performance
Image matching
- Evaluated methods on 108 HPatches sequences
- Baseline methods include learning-based descriptors and advanced matchers
- Match considered correct if reprojection error is below matching threshold
- Mean matching accuracy (MMA) is average percentage of correct matches for each image
- ParaFormer achieves best overall performance at matching thresholds of 5 or more pixels
- Detector-based methods better at handling scenarios with large viewpoint changes, detector-free methods better suited to address illumination changes
- ParaFormer outperforms LoFTR in illumination change experiments
Paraformer structural study
- Conducted ablation study on R1M dataset
- Compared FLOPs and runtimes between methods
- Parallel attention layer and Wave-PE used
- Improved performance with weight sharing
- Attentional pooling improved performance and saved parameters
Conclusion
- Propose a novel attention-based network called ParaFormer
- Wave-PE fuses features and positions in amplitude and phase manner
- ParaFormer uses parallel attention architecture
- Weight sharing and attention weight sharing strategies save parameters and computations
- ParaFormer-U uses U-Net architecture to reduce FLOPs
- Delivers state-of-the-art performance with remarkable efficiency
- Ablation studies conducted to analyze main designs, weight sharing, attention weight sharing, and pooling