Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Heavy computation is a bottleneck for deep-learning based feature matching algorithms.
  • Existing lightweight networks cannot address classical feature matching tasks.
  • This paper proposes two concepts: ParaFormer and a graph based U-Net architecture with attentional pooling.
  • ParaFormer fuses features and keypoint positions and integrates self- and cross-attention.
  • U-Net architecture and proposed attentional pooling reduce computational complexity.
  • Experiments demonstrate ParaFormer achieves state-of-the-art performance while maintaining high efficiency.
  • ParaFormer-U variant achieves comparable performance with less than 50% FLOPs of existing attention-based models.

Paper Content

Introduction

  • Feature matching is a fundamental problem for many computer vision tasks
  • It is challenging to find invariance and get robust matches from two images
  • Feature matching pipelines can be categorized into detector-based and detector-free methods
  • Attention-based networks are dominant methods in both pipelines
  • We propose parallel attention to compute self-and cross-attention synchronously
  • We explore weight sharing and attention weight sharing strategies to reduce parameters and computations
  • We propose attentional pooling to reduce FLOPs with minimal performance loss
  • Classical feature matching is a detector-based pipeline
  • Handcrafted methods and learning-based detectors have been proposed to improve robustness
  • SuperGlue was the first to propose an attention-based feature matching network
  • OETR further constrains attention-based feature matching
  • LoFTR applies self-and cross-attention directly on feature maps
  • MatchFormer abandons the CNN backbone and uses a hierarchical framework
  • MatchFormer proposes an interleaving strategy for self-and cross-attention
  • Position encoder allows the network to sense the relative or absolute position of each vector
  • Position encoding methods use fixed sine and cosine functions or learnable parameters
  • Relative position encoding adjusts attention weights with relative position
  • Convolution-based position encoders use convolution to augment local features
  • SuperGlue uses MLP to extend the coordinate vector to align with the descriptor
  • U-Net architecture consists of an encoder-decoder structure
  • Graph U-Nets propose the graph pooling layer to enable downsampling on graph data

Methodology

  • Method dynamically fuses positions and descriptors in amplitude and phase manner
  • Parallel attention module computes self and cross-attention synchronously
  • Utilizes global information to enhance representation capability of features
  • Enhanced descriptors are matched by optimal matching layer using Sinkhorn algorithm

Wave position encoder

  • MLP-PE has limited encoding capacity
  • Wave-PE is designed to dynamically adjust the relationship between descriptor and position
  • Wave-PE is represented as a wave with amplitude and phase information
  • Three learnable networks are used to estimate amplitude and phase and fuse real and imaginary parts into position encoding

Parallel attention

  • Linear projection of two sets of descriptors to form Q, K, V
  • Self-attention module uses standard attention computation
  • Cross-attention module uses weight sharing strategy to improve efficiency

U-net architecture

  • ParaFormer-U is designed for efficiency
  • Spatial downsampling and upsampling are used to extract and recover information
  • Attentional pooling is proposed for downsampling
  • Attention map shows important context points in the image
  • Unpooling operation inserts current feature matrix into empty feature matrix

Implementation details

  • Pretrained on R1M dataset
  • Finetuned on MegaDepth dataset
  • AdamW optimizer used
  • Batch size of 8 and initial learning rate of 0.0001
  • ParaFormer outperforms outlier rejection methods and attention-based matcher
  • ParaFormer boosts performance by integrating self-and cross-attention
  • ParaFormer-U alleviates computational complexity while maintaining performance

Image matching

  • Evaluated methods on 108 HPatches sequences
  • Baseline methods include learning-based descriptors and advanced matchers
  • Match considered correct if reprojection error is below matching threshold
  • Mean matching accuracy (MMA) is average percentage of correct matches for each image
  • ParaFormer achieves best overall performance at matching thresholds of 5 or more pixels
  • Detector-based methods better at handling scenarios with large viewpoint changes, detector-free methods better suited to address illumination changes
  • ParaFormer outperforms LoFTR in illumination change experiments

Paraformer structural study

  • Conducted ablation study on R1M dataset
  • Compared FLOPs and runtimes between methods
  • Parallel attention layer and Wave-PE used
  • Improved performance with weight sharing
  • Attentional pooling improved performance and saved parameters

Conclusion

  • Propose a novel attention-based network called ParaFormer
  • Wave-PE fuses features and positions in amplitude and phase manner
  • ParaFormer uses parallel attention architecture
  • Weight sharing and attention weight sharing strategies save parameters and computations
  • ParaFormer-U uses U-Net architecture to reduce FLOPs
  • Delivers state-of-the-art performance with remarkable efficiency
  • Ablation studies conducted to analyze main designs, weight sharing, attention weight sharing, and pooling