Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Advanced visual localization techniques require extracting global and local features from input images.
SuperGF is a transformer-based aggregation model that unifies local and global features for visual localization.
SuperGF is evaluated in terms of accuracy and efficiency, and is shown to be better than other methods.
SuperGF can be implemented using various types of local features.

Paper Content

Introduction

Visual localization is a key component in computer vision tasks
Visual localization is the problem of estimating the 6 Degree-of-Freedom (DoF) camera pose from which a given image was taken relative to a reference scene representation
Advanced visual localization approaches are hierarchical, encapsulating image retrieval problems and 6-DoF camera pose estimation
Two types of image features are needed to perform a hierarchical visual localization approach: global image features for retrieval and local image features for image-matching
Existing studies struggle to unify these two types of image features
A recent study unifies global and local features for visual localization using multitask distillation
A transformer is adopted to perform feature aggregation to bridge the feature-level gap between the two tasks
Experiments show the advantage of the proposed model compared to existing methods

Approaches for visual localization

Previous works on visual localization rely on estimating correspondences between 2D keypoints and 3D points in a sparse model.
Visual localization in large-scale urban environments is approached as an image retrieval problem.
Hierarchical localization divides the problem into a global, coarse search followed by a fine pose estimation.
HF-Net integrates learning-based models of image retrieval and image-matching to predict keypoints and descriptors for accurate 6-DoF localization.

Global and local image features

Hand-crafted local features were widely used in computer vision fields before the emergence of deep learning.
Traditional aggregation methods were developed to generate global image features for image retrieval.
Features emerging from convolutional neural networks (CNN) are robust and low-cost, but task-specific.
Transformers have been adopted for feature extraction in computer vision fields and achieved state-of-the-art performances.

Overview

SuperGF is a feature aggregation tool used for image matching and retrieval.
It works with both hand-crafted and learning-based descriptors.
It processes local features into tokens and performs position embedding.

Local feature processing

Input image is denoted as I ∈ R H×W
Feature map of local descriptors is denoted as F ∈ ξ
Output of tokenizer is denoted as T raw
Local features of input image are denoted as d i ∈ R i×d , p i ∈ R i×2 , and r i ∈ R i×1
Initial representation of each keypoint is t i ∈ R i×d
Clustering module is represented by function Φ(•) : R i×d → R N ×D
Core function of module is composed of dot-product attention function ψ and MLP

Aggregate to global image features

SuperGF uses vision transformer encoders to combine global contextual information and local features.
Skip-connections are added between two transformer encoders.
Output tokens are processed by MLP and layer normalization.

Training strategy

VPR is usually treated as a binary problem
Existing studies suggest that the similarity between images in VPR cannot be strictly defined in a binary fashion
VPR is a problem of ranking, not classification
AP loss is used to train the model by optimizing ranking results directly
Attention decorrelation loss is used to reduce the spatial correlation between attention maps

Experiments

Evaluated proposed SuperGF on benchmark datasets
Evaluated on two tasks: VPR and visual localization
Examined performance of global image features generated by SuperGF

Implementation details

Latent embedding dimension of transformer encoders is 512
Hidden dimension of MLP blocks is 1024
Trained on MSLS training set with settings reported in section
10,000 query samples randomly selected for one epoch, 350 epochs in total
AdamW optimizer, weight decay = 10-4
Initial learning rate 10-4, final learning rate 10-6
Input image size 640x480

Evaluation datasets

Evaluated method on public benchmark datasets: MSLS, Pitts30k, Nordland, Tokyo247 for VPR and RobotCar Seasons dataset for localization
Datasets contain challenging appearance variations such as day-night, weather, and season
Images resized to 640x480 for evaluation

Metrics

Used Recall@N metric to measure success of query images in VPR datasets
Query image considered retrieved successfully if at least one of top N ranked reference images is within threshold distance from ground truth location
Default threshold definitions used for all datasets
Pose recall at position and orientation thresholds different for each sequence in RobotCar Seasons dataset

Compared methods

Compared SuperGF to several models on VPR and visual localization tasks
Experimented with single-pass and two-stage retrieval
Compared with NetVLAD, SFRS, ResNet50-GeM-GCL, GeM-AP, SOLAR, TransVPR, DELG, Patch-NetVLAD, and TransVPR
Compared with NV-SP-SG, AS, CSL, NV+SIFT, NV+SP, and HF-Net

Results and discussion

Single-pass retrieval

Performance of single-pass retrieval based on global image features is highly relevant to the dataset.
Global features are a compact representation of the entire image, but not suitable for high-precision retrievals.
Transformer in feature aggregation can generate global image features on par with state-of-the-art retrieval models.
Learned local features show better results than hand-craft local features.

Two-stage retrieval

Robustness is improved by re-ranking based on geometry verification
Spatial information is important for image retrieval
Sparse local features designed for image matching perform on par or better than task-specific local features
SuperGlue matcher shows state-of-the-art performance

Visual localization

Pose recall is measured at different thresholds for each sequence.
Structure-based methods (AS and CSL) perform better on easier sequences.
Hierarchical methods perform better on more challenging sequences.
NV+SIFT performs better than NV+SP on easier sequences, but NV+SP performs better on more challenging sequences.

Latency and memory

Latency and scalability are important factors for real-time visual localization.
Table 4 shows computational time and memory requirements for compared methods of extracting glocal image features.
SuperGF shows advantages in terms of latency and scalability compared to other methods.
Combining results of previous sections, SuperGF uses minimal resources and generates SOTA-level global image features.
Sparse version of SuperGF causes a loss of retrieval accuracy, but has advantages in terms of feature extraction efficiency.
SuperGF is affected by the adopted local descriptor significantly.

Summary and conclusion

SuperGF is a transformer-based method for 6-DoF localization
Different versions of SuperGF are available, with learning-based and hand-craft descriptors
SuperPoint is recommended to be used with SuperGF
Dense and sparse versions of SuperGF are available
Training data includes one query image, one positive sample, and α + β negatives
Results show accuracy and efficiency of SuperGF

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Approaches for visual localization#

Global and local image features#

Overview#

Local feature processing#

Aggregate to global image features#

Training strategy#

Experiments#

Implementation details#

Evaluation datasets#

Metrics#

Compared methods#

Results and discussion#

Single-pass retrieval#

Two-stage retrieval#

Visual localization#

Latency and memory#

Summary and conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Approaches for visual localization

Global and local image features

Overview

Local feature processing

Aggregate to global image features

Training strategy

Experiments

Implementation details

Evaluation datasets

Metrics

Compared methods

Results and discussion

Single-pass retrieval

Two-stage retrieval

Visual localization

Latency and memory

Summary and conclusion