Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Advanced visual localization techniques require extracting global and local features from input images.
- SuperGF is a transformer-based aggregation model that unifies local and global features for visual localization.
- SuperGF is evaluated in terms of accuracy and efficiency, and is shown to be better than other methods.
- SuperGF can be implemented using various types of local features.
Paper Content
Introduction
- Visual localization is a key component in computer vision tasks
- Visual localization is the problem of estimating the 6 Degree-of-Freedom (DoF) camera pose from which a given image was taken relative to a reference scene representation
- Advanced visual localization approaches are hierarchical, encapsulating image retrieval problems and 6-DoF camera pose estimation
- Two types of image features are needed to perform a hierarchical visual localization approach: global image features for retrieval and local image features for image-matching
- Existing studies struggle to unify these two types of image features
- A recent study unifies global and local features for visual localization using multitask distillation
- A transformer is adopted to perform feature aggregation to bridge the feature-level gap between the two tasks
- Experiments show the advantage of the proposed model compared to existing methods
Related work
Approaches for visual localization
- Previous works on visual localization rely on estimating correspondences between 2D keypoints and 3D points in a sparse model.
- Visual localization in large-scale urban environments is approached as an image retrieval problem.
- Hierarchical localization divides the problem into a global, coarse search followed by a fine pose estimation.
- HF-Net integrates learning-based models of image retrieval and image-matching to predict keypoints and descriptors for accurate 6-DoF localization.
Global and local image features
- Hand-crafted local features were widely used in computer vision fields before the emergence of deep learning.
- Traditional aggregation methods were developed to generate global image features for image retrieval.
- Features emerging from convolutional neural networks (CNN) are robust and low-cost, but task-specific.
- Transformers have been adopted for feature extraction in computer vision fields and achieved state-of-the-art performances.
Overview
- SuperGF is a feature aggregation tool used for image matching and retrieval.
- It works with both hand-crafted and learning-based descriptors.
- It processes local features into tokens and performs position embedding.
Local feature processing
- Input image is denoted as I ∈ R H×W
- Feature map of local descriptors is denoted as F ∈ ξ
- Output of tokenizer is denoted as T raw
- Local features of input image are denoted as d i ∈ R i×d , p i ∈ R i×2 , and r i ∈ R i×1
- Initial representation of each keypoint is t i ∈ R i×d
- Clustering module is represented by function Φ(•) : R i×d → R N ×D
- Core function of module is composed of dot-product attention function ψ and MLP
Aggregate to global image features
- SuperGF uses vision transformer encoders to combine global contextual information and local features.
- Skip-connections are added between two transformer encoders.
- Output tokens are processed by MLP and layer normalization.
Training strategy
- VPR is usually treated as a binary problem
- Existing studies suggest that the similarity between images in VPR cannot be strictly defined in a binary fashion
- VPR is a problem of ranking, not classification
- AP loss is used to train the model by optimizing ranking results directly
- Attention decorrelation loss is used to reduce the spatial correlation between attention maps
Experiments
- Evaluated proposed SuperGF on benchmark datasets
- Evaluated on two tasks: VPR and visual localization
- Examined performance of global image features generated by SuperGF
Implementation details
- Latent embedding dimension of transformer encoders is 512
- Hidden dimension of MLP blocks is 1024
- Trained on MSLS training set with settings reported in section
- 10,000 query samples randomly selected for one epoch, 350 epochs in total
- AdamW optimizer, weight decay = 10-4
- Initial learning rate 10-4, final learning rate 10-6
- Input image size 640x480
Evaluation datasets
- Evaluated method on public benchmark datasets: MSLS, Pitts30k, Nordland, Tokyo247 for VPR and RobotCar Seasons dataset for localization
- Datasets contain challenging appearance variations such as day-night, weather, and season
- Images resized to 640x480 for evaluation
Metrics
- Used Recall@N metric to measure success of query images in VPR datasets
- Query image considered retrieved successfully if at least one of top N ranked reference images is within threshold distance from ground truth location
- Default threshold definitions used for all datasets
- Pose recall at position and orientation thresholds different for each sequence in RobotCar Seasons dataset
Compared methods
- Compared SuperGF to several models on VPR and visual localization tasks
- Experimented with single-pass and two-stage retrieval
- Compared with NetVLAD, SFRS, ResNet50-GeM-GCL, GeM-AP, SOLAR, TransVPR, DELG, Patch-NetVLAD, and TransVPR
- Compared with NV-SP-SG, AS, CSL, NV+SIFT, NV+SP, and HF-Net
Results and discussion
Single-pass retrieval
- Performance of single-pass retrieval based on global image features is highly relevant to the dataset.
- Global features are a compact representation of the entire image, but not suitable for high-precision retrievals.
- Transformer in feature aggregation can generate global image features on par with state-of-the-art retrieval models.
- Learned local features show better results than hand-craft local features.
Two-stage retrieval
- Robustness is improved by re-ranking based on geometry verification
- Spatial information is important for image retrieval
- Sparse local features designed for image matching perform on par or better than task-specific local features
- SuperGlue matcher shows state-of-the-art performance
Visual localization
- Pose recall is measured at different thresholds for each sequence.
- Structure-based methods (AS and CSL) perform better on easier sequences.
- Hierarchical methods perform better on more challenging sequences.
- NV+SIFT performs better than NV+SP on easier sequences, but NV+SP performs better on more challenging sequences.
Latency and memory
- Latency and scalability are important factors for real-time visual localization.
- Table 4 shows computational time and memory requirements for compared methods of extracting glocal image features.
- SuperGF shows advantages in terms of latency and scalability compared to other methods.
- Combining results of previous sections, SuperGF uses minimal resources and generates SOTA-level global image features.
- Sparse version of SuperGF causes a loss of retrieval accuracy, but has advantages in terms of feature extraction efficiency.
- SuperGF is affected by the adopted local descriptor significantly.
Summary and conclusion
- SuperGF is a transformer-based method for 6-DoF localization
- Different versions of SuperGF are available, with learning-based and hand-craft descriptors
- SuperPoint is recommended to be used with SuperGF
- Dense and sparse versions of SuperGF are available
- Training data includes one query image, one positive sample, and α + β negatives
- Results show accuracy and efficiency of SuperGF