Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Heavy computation is a bottleneck for deep-learning based feature matching algorithms.
Existing lightweight networks cannot address classical feature matching tasks.
This paper proposes two concepts: ParaFormer and a graph based U-Net architecture with attentional pooling.
ParaFormer fuses features and keypoint positions and integrates self- and cross-attention.
U-Net architecture and proposed attentional pooling reduce computational complexity.
Experiments demonstrate ParaFormer achieves state-of-the-art performance while maintaining high efficiency.
ParaFormer-U variant achieves comparable performance with less than 50% FLOPs of existing attention-based models.

Paper Content

Introduction

Feature matching is a fundamental problem for many computer vision tasks
It is challenging to find invariance and get robust matches from two images
Feature matching pipelines can be categorized into detector-based and detector-free methods
Attention-based networks are dominant methods in both pipelines
We propose parallel attention to compute self-and cross-attention synchronously
We explore weight sharing and attention weight sharing strategies to reduce parameters and computations
We propose attentional pooling to reduce FLOPs with minimal performance loss

Classical feature matching is a detector-based pipeline
Handcrafted methods and learning-based detectors have been proposed to improve robustness
SuperGlue was the first to propose an attention-based feature matching network
OETR further constrains attention-based feature matching
LoFTR applies self-and cross-attention directly on feature maps
MatchFormer abandons the CNN backbone and uses a hierarchical framework
MatchFormer proposes an interleaving strategy for self-and cross-attention
Position encoder allows the network to sense the relative or absolute position of each vector
Position encoding methods use fixed sine and cosine functions or learnable parameters
Relative position encoding adjusts attention weights with relative position
Convolution-based position encoders use convolution to augment local features
SuperGlue uses MLP to extend the coordinate vector to align with the descriptor
U-Net architecture consists of an encoder-decoder structure
Graph U-Nets propose the graph pooling layer to enable downsampling on graph data

Methodology

Method dynamically fuses positions and descriptors in amplitude and phase manner
Parallel attention module computes self and cross-attention synchronously
Utilizes global information to enhance representation capability of features
Enhanced descriptors are matched by optimal matching layer using Sinkhorn algorithm

Wave position encoder

MLP-PE has limited encoding capacity
Wave-PE is designed to dynamically adjust the relationship between descriptor and position
Wave-PE is represented as a wave with amplitude and phase information
Three learnable networks are used to estimate amplitude and phase and fuse real and imaginary parts into position encoding

Parallel attention

Linear projection of two sets of descriptors to form Q, K, V
Self-attention module uses standard attention computation
Cross-attention module uses weight sharing strategy to improve efficiency

U-net architecture

ParaFormer-U is designed for efficiency
Spatial downsampling and upsampling are used to extract and recover information
Attentional pooling is proposed for downsampling
Attention map shows important context points in the image
Unpooling operation inserts current feature matrix into empty feature matrix

Implementation details

Pretrained on R1M dataset
Finetuned on MegaDepth dataset
AdamW optimizer used
Batch size of 8 and initial learning rate of 0.0001
ParaFormer outperforms outlier rejection methods and attention-based matcher
ParaFormer boosts performance by integrating self-and cross-attention
ParaFormer-U alleviates computational complexity while maintaining performance

Image matching

Evaluated methods on 108 HPatches sequences
Baseline methods include learning-based descriptors and advanced matchers
Match considered correct if reprojection error is below matching threshold
Mean matching accuracy (MMA) is average percentage of correct matches for each image
ParaFormer achieves best overall performance at matching thresholds of 5 or more pixels
Detector-based methods better at handling scenarios with large viewpoint changes, detector-free methods better suited to address illumination changes
ParaFormer outperforms LoFTR in illumination change experiments

Paraformer structural study

Conducted ablation study on R1M dataset
Compared FLOPs and runtimes between methods
Parallel attention layer and Wave-PE used
Improved performance with weight sharing
Attentional pooling improved performance and saved parameters

Conclusion

Propose a novel attention-based network called ParaFormer
Wave-PE fuses features and positions in amplitude and phase manner
ParaFormer uses parallel attention architecture
Weight sharing and attention weight sharing strategies save parameters and computations
ParaFormer-U uses U-Net architecture to reduce FLOPs
Delivers state-of-the-art performance with remarkable efficiency
Ablation studies conducted to analyze main designs, weight sharing, attention weight sharing, and pooling

Link to paper#

Abstract#

Paper Content#

Introduction#

Related works#

Methodology#

Wave position encoder#

Parallel attention#

U-net architecture#

Implementation details#

Image matching#

Paraformer structural study#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related works

Methodology

Wave position encoder

Parallel attention

U-net architecture

Implementation details

Image matching

Paraformer structural study

Conclusion