Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Existing methods detect keypoints in a non-differentiable way
- Proposed method is partially differentiable and outputs accurate sub-pixel keypoints
- Reprojection loss and dispersity peak loss proposed to optimize and regularize keypoints
- Descriptors extracted in a sub-pixel way and trained with neural reprojection error loss
- Lightweight network designed for keypoint detection and descriptor extraction, runs at 95 frames per second
Paper Content
I. introduction
- Keypoints and descriptors are compact representations used in real-time visual applications.
- Early algorithms were based on limited human heuristics.
- Neural networks are explored for keypoint detection and descriptor extraction.
- Score map is used to detect keypoints, but gradients can’t be back-propagated.
- Differentiable Keypoint Detection (DKD) is proposed to optimize the position of detected keypoints.
- Neural Reprojection Error (NRE) loss is used to train the estimated dense descriptor map.
- Lightweight network is designed for efficient keypoint detection and descriptor extraction.
- Experiments show comparable performance to state-of-the-art approaches.
Ii. related works
- Patch-based deep learning methods
- Score map based deep learning methods
- Description-and-detection deep learning methods
A. patch-based methods
- Early patch-based methods only extract descriptors from image patches
- Matchnet estimates similarity of descriptors and trains them with cross entropy loss
- TFeat introduces triplet loss for patch descriptors and is widely used in later patch-based methods
- L2-Net presents progressive sampling strategy for triplet sampling
- LIFT mimics SIFT
- HardNet and SOSNet introduce hardest negative triplet and second order similarity of descriptors
- Patch-based methods only focus on descriptors extraction and have limited receptive field
B. score map based methods
- Estimate score map and descriptor map
- Tilde [18] trains score map on webcam dataset
- Quad-networks [19] trains score map by ranking scores
- KeyNet [34] estimates score map with handcrafted and learned features
- LFNet [17], SuperPoint [12], MLIFeat [31], SEKD [20] train on score map
- R2D2 [13] identifies keypoints and trains reliability
- HDD-Net [36] weights features with softargmax scores
- DISK [37] and reinforced SP [38] relax keypoint detection and descriptor matching
- DKD is similar to KeyNet [34] and utilizes softargmax
- DKD does not require handcrafted features or pseudo keypoint annotations
- NRE loss [29] used to train sub-pixel descriptors
C. description-and-detection methods
- Score map based methods use detection-then-description, while description-and-detection methods recognize keypoints in an image.
- D2Net, ASLFeat, UR2KiD and D2D all compute score maps differently.
- All of these methods use non-differentiable NMS to detect keypoints.
Iii. methods
- Network estimates score map and descriptor map from input image
- Sub-pixel keypoints and descriptors are detected and sampled from score and descriptor maps
A. network architecture
- Network is designed to be lightweight to improve running efficiency
- Feature aggregation module assembles multi-level features to retain localization and representation capabilities
- Feature extraction head estimates score map and descriptor map under original image resolution
- Image feature encoder encodes input image to feature maps with four blocks
B. differentiable keypoint detection module
- NMS is a widely used method to detect keypoints in score maps.
- NMS finds the pixels with the maximum score within local windows.
- Softargmax couples the keypoints with score map by extracting differentiable keypoints from local windows.
C. learning accurate keypoints
- Reprojection loss optimizes the position of keypoints
- Dispersity peak loss ensures that the score is maximal at the keypoint position
- Keypoints are warped from one image to another using a differentiable warp function
- Reprojection loss pulls the warped keypoint and its corresponding keypoints together
- Dispersity peak loss regularizes the scores in the local window to be “peaky”
- Dispersity peak loss takes the spatial distribution of scores into account
D. learning discriminative descriptor
- Descriptors of the same keypoints should be identical, descriptors of different keypoints should be distinct.
- Triplet loss is used to train descriptor discriminativeness.
- Triplet loss only optimizes sparse descriptors, so dense descriptor map cannot be fully constrained.
- NRE loss is used to train with dense descriptor map and provide comprehensive constraint.
E. learning reliable keypoint
- Reprojection and dispersity peak loss provide accurate and repeatable keypoints
- Spatial properties of descriptor map are not taken into account, leading to unreliable keypoints
- Reliability loss based on matching probability map introduced to address this issue
- Reliability of keypoint assessed by normalized similarity map and bilinear sampling
- Reliability loss defined for all valid keypoints in image A
A. datasets
- MegaDepth dataset includes tourist photos and 3D maps
- DISK image pairs used to train model
- 135 scenes with 63k images in total from HPatches dataset
- IMW2020 used for camera pose estimation
- Aachen Day-Night dataset used to evaluate descriptor effectiveness
B. training details
- Used DKD with window size of 5 to detect 400 keypoints
- Set normalization temperatures as t det = 0.1, t rel = 1, and t des = 0.02
2) training setups:
- Images were cropped and resized to 480 x 480
- Network was trained using ADAM optimizer
- Learning rate started at 0 and warmed up to 3e-3 in 500 steps
- Batch size was set to 1, but gradient was accumulated over 16 batches
- Model converged on NVIDIA Titan RTX in 2 days
C. evaluation metrics
- Assume image A and image B have N A and N B co-visible keypoints
- Number of co-visible keypoints is defined as N cov = (N A + N B )/2
- N gt ground truth keypoint pairs with reprojection distance less than 3 pixels
- N putative matches obtained with mutual matching of descriptors
- N inlier inlier matches acquired by assessing reprojection distance within different pixels threshold
- Metrics: Rep = N gt /N cov, M S = N inlier /N cov, M M A = N inlier /N putative, M HA = percentage of correct image corners with estimated homography matrix
- Ablation studies on network architecture and loss functions
- Proposed methods compared to state-of-the-arts on homography estimation, camera pose estimation, and visual (re-)localization tasks
- Network complexity: number of parameters, GFLOPs, and inference FPS
- Homography estimation: MMA and MHA at stricter thresholds
- Camera pose estimation: mAA, GFLOPs, and PPC
- Visual (re-)localization: performance with limited and unlimited features
- Failure cases in image matching: extreme illumination changes and large viewpoint differences