Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Existing methods detect keypoints in a non-differentiable way
Proposed method is partially differentiable and outputs accurate sub-pixel keypoints
Reprojection loss and dispersity peak loss proposed to optimize and regularize keypoints
Descriptors extracted in a sub-pixel way and trained with neural reprojection error loss
Lightweight network designed for keypoint detection and descriptor extraction, runs at 95 frames per second

Paper Content

I. introduction

Keypoints and descriptors are compact representations used in real-time visual applications.
Early algorithms were based on limited human heuristics.
Neural networks are explored for keypoint detection and descriptor extraction.
Score map is used to detect keypoints, but gradients can’t be back-propagated.
Differentiable Keypoint Detection (DKD) is proposed to optimize the position of detected keypoints.
Neural Reprojection Error (NRE) loss is used to train the estimated dense descriptor map.
Lightweight network is designed for efficient keypoint detection and descriptor extraction.
Experiments show comparable performance to state-of-the-art approaches.

Patch-based deep learning methods
Score map based deep learning methods
Description-and-detection deep learning methods

A. patch-based methods

Early patch-based methods only extract descriptors from image patches
Matchnet estimates similarity of descriptors and trains them with cross entropy loss
TFeat introduces triplet loss for patch descriptors and is widely used in later patch-based methods
L2-Net presents progressive sampling strategy for triplet sampling
LIFT mimics SIFT
HardNet and SOSNet introduce hardest negative triplet and second order similarity of descriptors
Patch-based methods only focus on descriptors extraction and have limited receptive field

B. score map based methods

Estimate score map and descriptor map
Tilde [18] trains score map on webcam dataset
Quad-networks [19] trains score map by ranking scores
KeyNet [34] estimates score map with handcrafted and learned features
LFNet [17], SuperPoint [12], MLIFeat [31], SEKD [20] train on score map
R2D2 [13] identifies keypoints and trains reliability
HDD-Net [36] weights features with softargmax scores
DISK [37] and reinforced SP [38] relax keypoint detection and descriptor matching
DKD is similar to KeyNet [34] and utilizes softargmax
DKD does not require handcrafted features or pseudo keypoint annotations
NRE loss [29] used to train sub-pixel descriptors

C. description-and-detection methods

Score map based methods use detection-then-description, while description-and-detection methods recognize keypoints in an image.
D2Net, ASLFeat, UR2KiD and D2D all compute score maps differently.
All of these methods use non-differentiable NMS to detect keypoints.

Iii. methods

Network estimates score map and descriptor map from input image
Sub-pixel keypoints and descriptors are detected and sampled from score and descriptor maps

A. network architecture

Network is designed to be lightweight to improve running efficiency
Feature aggregation module assembles multi-level features to retain localization and representation capabilities
Feature extraction head estimates score map and descriptor map under original image resolution
Image feature encoder encodes input image to feature maps with four blocks

B. differentiable keypoint detection module

NMS is a widely used method to detect keypoints in score maps.
NMS finds the pixels with the maximum score within local windows.
Softargmax couples the keypoints with score map by extracting differentiable keypoints from local windows.

C. learning accurate keypoints

Reprojection loss optimizes the position of keypoints
Dispersity peak loss ensures that the score is maximal at the keypoint position
Keypoints are warped from one image to another using a differentiable warp function
Reprojection loss pulls the warped keypoint and its corresponding keypoints together
Dispersity peak loss regularizes the scores in the local window to be “peaky”
Dispersity peak loss takes the spatial distribution of scores into account

D. learning discriminative descriptor

Descriptors of the same keypoints should be identical, descriptors of different keypoints should be distinct.
Triplet loss is used to train descriptor discriminativeness.
Triplet loss only optimizes sparse descriptors, so dense descriptor map cannot be fully constrained.
NRE loss is used to train with dense descriptor map and provide comprehensive constraint.

E. learning reliable keypoint

Reprojection and dispersity peak loss provide accurate and repeatable keypoints
Spatial properties of descriptor map are not taken into account, leading to unreliable keypoints
Reliability loss based on matching probability map introduced to address this issue
Reliability of keypoint assessed by normalized similarity map and bilinear sampling
Reliability loss defined for all valid keypoints in image A

A. datasets

MegaDepth dataset includes tourist photos and 3D maps
DISK image pairs used to train model
135 scenes with 63k images in total from HPatches dataset
IMW2020 used for camera pose estimation
Aachen Day-Night dataset used to evaluate descriptor effectiveness

B. training details

Used DKD with window size of 5 to detect 400 keypoints
Set normalization temperatures as t det = 0.1, t rel = 1, and t des = 0.02

2) training setups:

Images were cropped and resized to 480 x 480
Network was trained using ADAM optimizer
Learning rate started at 0 and warmed up to 3e-3 in 500 steps
Batch size was set to 1, but gradient was accumulated over 16 batches
Model converged on NVIDIA Titan RTX in 2 days

C. evaluation metrics

Assume image A and image B have N A and N B co-visible keypoints
Number of co-visible keypoints is defined as N cov = (N A + N B )/2
N gt ground truth keypoint pairs with reprojection distance less than 3 pixels
N putative matches obtained with mutual matching of descriptors
N inlier inlier matches acquired by assessing reprojection distance within different pixels threshold
Metrics: Rep = N gt /N cov, M S = N inlier /N cov, M M A = N inlier /N putative, M HA = percentage of correct image corners with estimated homography matrix
Ablation studies on network architecture and loss functions
Proposed methods compared to state-of-the-arts on homography estimation, camera pose estimation, and visual (re-)localization tasks
Network complexity: number of parameters, GFLOPs, and inference FPS
Homography estimation: MMA and MHA at stricter thresholds
Camera pose estimation: mAA, GFLOPs, and PPC
Visual (re-)localization: performance with limited and unlimited features
Failure cases in image matching: extreme illumination changes and large viewpoint differences

Link to paper#

Abstract#

Paper Content#

I. introduction#

Ii. related works#

A. patch-based methods#

B. score map based methods#

C. description-and-detection methods#

Iii. methods#

A. network architecture#

B. differentiable keypoint detection module#

C. learning accurate keypoints#

D. learning discriminative descriptor#

E. learning reliable keypoint#

A. datasets#

B. training details#

2) training setups:#

C. evaluation metrics#

Link to paper

Abstract

Paper Content

I. introduction

Ii. related works

A. patch-based methods

B. score map based methods

C. description-and-detection methods

Iii. methods

A. network architecture

B. differentiable keypoint detection module

C. learning accurate keypoints

D. learning discriminative descriptor

E. learning reliable keypoint

A. datasets

B. training details

2) training setups:

C. evaluation metrics