Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Existing methods detect keypoints in a non-differentiable way
  • Proposed method is partially differentiable and outputs accurate sub-pixel keypoints
  • Reprojection loss and dispersity peak loss proposed to optimize and regularize keypoints
  • Descriptors extracted in a sub-pixel way and trained with neural reprojection error loss
  • Lightweight network designed for keypoint detection and descriptor extraction, runs at 95 frames per second

Paper Content

I. introduction

  • Keypoints and descriptors are compact representations used in real-time visual applications.
  • Early algorithms were based on limited human heuristics.
  • Neural networks are explored for keypoint detection and descriptor extraction.
  • Score map is used to detect keypoints, but gradients can’t be back-propagated.
  • Differentiable Keypoint Detection (DKD) is proposed to optimize the position of detected keypoints.
  • Neural Reprojection Error (NRE) loss is used to train the estimated dense descriptor map.
  • Lightweight network is designed for efficient keypoint detection and descriptor extraction.
  • Experiments show comparable performance to state-of-the-art approaches.
  • Patch-based deep learning methods
  • Score map based deep learning methods
  • Description-and-detection deep learning methods

A. patch-based methods

  • Early patch-based methods only extract descriptors from image patches
  • Matchnet estimates similarity of descriptors and trains them with cross entropy loss
  • TFeat introduces triplet loss for patch descriptors and is widely used in later patch-based methods
  • L2-Net presents progressive sampling strategy for triplet sampling
  • LIFT mimics SIFT
  • HardNet and SOSNet introduce hardest negative triplet and second order similarity of descriptors
  • Patch-based methods only focus on descriptors extraction and have limited receptive field

B. score map based methods

  • Estimate score map and descriptor map
  • Tilde [18] trains score map on webcam dataset
  • Quad-networks [19] trains score map by ranking scores
  • KeyNet [34] estimates score map with handcrafted and learned features
  • LFNet [17], SuperPoint [12], MLIFeat [31], SEKD [20] train on score map
  • R2D2 [13] identifies keypoints and trains reliability
  • HDD-Net [36] weights features with softargmax scores
  • DISK [37] and reinforced SP [38] relax keypoint detection and descriptor matching
  • DKD is similar to KeyNet [34] and utilizes softargmax
  • DKD does not require handcrafted features or pseudo keypoint annotations
  • NRE loss [29] used to train sub-pixel descriptors

C. description-and-detection methods

  • Score map based methods use detection-then-description, while description-and-detection methods recognize keypoints in an image.
  • D2Net, ASLFeat, UR2KiD and D2D all compute score maps differently.
  • All of these methods use non-differentiable NMS to detect keypoints.

Iii. methods

  • Network estimates score map and descriptor map from input image
  • Sub-pixel keypoints and descriptors are detected and sampled from score and descriptor maps

A. network architecture

  • Network is designed to be lightweight to improve running efficiency
  • Feature aggregation module assembles multi-level features to retain localization and representation capabilities
  • Feature extraction head estimates score map and descriptor map under original image resolution
  • Image feature encoder encodes input image to feature maps with four blocks

B. differentiable keypoint detection module

  • NMS is a widely used method to detect keypoints in score maps.
  • NMS finds the pixels with the maximum score within local windows.
  • Softargmax couples the keypoints with score map by extracting differentiable keypoints from local windows.

C. learning accurate keypoints

  • Reprojection loss optimizes the position of keypoints
  • Dispersity peak loss ensures that the score is maximal at the keypoint position
  • Keypoints are warped from one image to another using a differentiable warp function
  • Reprojection loss pulls the warped keypoint and its corresponding keypoints together
  • Dispersity peak loss regularizes the scores in the local window to be “peaky”
  • Dispersity peak loss takes the spatial distribution of scores into account

D. learning discriminative descriptor

  • Descriptors of the same keypoints should be identical, descriptors of different keypoints should be distinct.
  • Triplet loss is used to train descriptor discriminativeness.
  • Triplet loss only optimizes sparse descriptors, so dense descriptor map cannot be fully constrained.
  • NRE loss is used to train with dense descriptor map and provide comprehensive constraint.

E. learning reliable keypoint

  • Reprojection and dispersity peak loss provide accurate and repeatable keypoints
  • Spatial properties of descriptor map are not taken into account, leading to unreliable keypoints
  • Reliability loss based on matching probability map introduced to address this issue
  • Reliability of keypoint assessed by normalized similarity map and bilinear sampling
  • Reliability loss defined for all valid keypoints in image A

A. datasets

  • MegaDepth dataset includes tourist photos and 3D maps
  • DISK image pairs used to train model
  • 135 scenes with 63k images in total from HPatches dataset
  • IMW2020 used for camera pose estimation
  • Aachen Day-Night dataset used to evaluate descriptor effectiveness

B. training details

  • Used DKD with window size of 5 to detect 400 keypoints
  • Set normalization temperatures as t det = 0.1, t rel = 1, and t des = 0.02

2) training setups:

  • Images were cropped and resized to 480 x 480
  • Network was trained using ADAM optimizer
  • Learning rate started at 0 and warmed up to 3e-3 in 500 steps
  • Batch size was set to 1, but gradient was accumulated over 16 batches
  • Model converged on NVIDIA Titan RTX in 2 days

C. evaluation metrics

  • Assume image A and image B have N A and N B co-visible keypoints
  • Number of co-visible keypoints is defined as N cov = (N A + N B )/2
  • N gt ground truth keypoint pairs with reprojection distance less than 3 pixels
  • N putative matches obtained with mutual matching of descriptors
  • N inlier inlier matches acquired by assessing reprojection distance within different pixels threshold
  • Metrics: Rep = N gt /N cov, M S = N inlier /N cov, M M A = N inlier /N putative, M HA = percentage of correct image corners with estimated homography matrix
  • Ablation studies on network architecture and loss functions
  • Proposed methods compared to state-of-the-arts on homography estimation, camera pose estimation, and visual (re-)localization tasks
  • Network complexity: number of parameters, GFLOPs, and inference FPS
  • Homography estimation: MMA and MHA at stricter thresholds
  • Camera pose estimation: mAA, GFLOPs, and PPC
  • Visual (re-)localization: performance with limited and unlimited features
  • Failure cases in image matching: extreme illumination changes and large viewpoint differences