Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Advanced visual localization techniques require extracting global and local features from input images.
  • SuperGF is a transformer-based aggregation model that unifies local and global features for visual localization.
  • SuperGF is evaluated in terms of accuracy and efficiency, and is shown to be better than other methods.
  • SuperGF can be implemented using various types of local features.

Paper Content

Introduction

  • Visual localization is a key component in computer vision tasks
  • Visual localization is the problem of estimating the 6 Degree-of-Freedom (DoF) camera pose from which a given image was taken relative to a reference scene representation
  • Advanced visual localization approaches are hierarchical, encapsulating image retrieval problems and 6-DoF camera pose estimation
  • Two types of image features are needed to perform a hierarchical visual localization approach: global image features for retrieval and local image features for image-matching
  • Existing studies struggle to unify these two types of image features
  • A recent study unifies global and local features for visual localization using multitask distillation
  • A transformer is adopted to perform feature aggregation to bridge the feature-level gap between the two tasks
  • Experiments show the advantage of the proposed model compared to existing methods

Approaches for visual localization

  • Previous works on visual localization rely on estimating correspondences between 2D keypoints and 3D points in a sparse model.
  • Visual localization in large-scale urban environments is approached as an image retrieval problem.
  • Hierarchical localization divides the problem into a global, coarse search followed by a fine pose estimation.
  • HF-Net integrates learning-based models of image retrieval and image-matching to predict keypoints and descriptors for accurate 6-DoF localization.

Global and local image features

  • Hand-crafted local features were widely used in computer vision fields before the emergence of deep learning.
  • Traditional aggregation methods were developed to generate global image features for image retrieval.
  • Features emerging from convolutional neural networks (CNN) are robust and low-cost, but task-specific.
  • Transformers have been adopted for feature extraction in computer vision fields and achieved state-of-the-art performances.

Overview

  • SuperGF is a feature aggregation tool used for image matching and retrieval.
  • It works with both hand-crafted and learning-based descriptors.
  • It processes local features into tokens and performs position embedding.

Local feature processing

  • Input image is denoted as I ∈ R H×W
  • Feature map of local descriptors is denoted as F ∈ ξ
  • Output of tokenizer is denoted as T raw
  • Local features of input image are denoted as d i ∈ R i×d , p i ∈ R i×2 , and r i ∈ R i×1
  • Initial representation of each keypoint is t i ∈ R i×d
  • Clustering module is represented by function Φ(•) : R i×d → R N ×D
  • Core function of module is composed of dot-product attention function ψ and MLP

Aggregate to global image features

  • SuperGF uses vision transformer encoders to combine global contextual information and local features.
  • Skip-connections are added between two transformer encoders.
  • Output tokens are processed by MLP and layer normalization.

Training strategy

  • VPR is usually treated as a binary problem
  • Existing studies suggest that the similarity between images in VPR cannot be strictly defined in a binary fashion
  • VPR is a problem of ranking, not classification
  • AP loss is used to train the model by optimizing ranking results directly
  • Attention decorrelation loss is used to reduce the spatial correlation between attention maps

Experiments

  • Evaluated proposed SuperGF on benchmark datasets
  • Evaluated on two tasks: VPR and visual localization
  • Examined performance of global image features generated by SuperGF

Implementation details

  • Latent embedding dimension of transformer encoders is 512
  • Hidden dimension of MLP blocks is 1024
  • Trained on MSLS training set with settings reported in section
  • 10,000 query samples randomly selected for one epoch, 350 epochs in total
  • AdamW optimizer, weight decay = 10-4
  • Initial learning rate 10-4, final learning rate 10-6
  • Input image size 640x480

Evaluation datasets

  • Evaluated method on public benchmark datasets: MSLS, Pitts30k, Nordland, Tokyo247 for VPR and RobotCar Seasons dataset for localization
  • Datasets contain challenging appearance variations such as day-night, weather, and season
  • Images resized to 640x480 for evaluation

Metrics

  • Used Recall@N metric to measure success of query images in VPR datasets
  • Query image considered retrieved successfully if at least one of top N ranked reference images is within threshold distance from ground truth location
  • Default threshold definitions used for all datasets
  • Pose recall at position and orientation thresholds different for each sequence in RobotCar Seasons dataset

Compared methods

  • Compared SuperGF to several models on VPR and visual localization tasks
  • Experimented with single-pass and two-stage retrieval
  • Compared with NetVLAD, SFRS, ResNet50-GeM-GCL, GeM-AP, SOLAR, TransVPR, DELG, Patch-NetVLAD, and TransVPR
  • Compared with NV-SP-SG, AS, CSL, NV+SIFT, NV+SP, and HF-Net

Results and discussion

Single-pass retrieval

  • Performance of single-pass retrieval based on global image features is highly relevant to the dataset.
  • Global features are a compact representation of the entire image, but not suitable for high-precision retrievals.
  • Transformer in feature aggregation can generate global image features on par with state-of-the-art retrieval models.
  • Learned local features show better results than hand-craft local features.

Two-stage retrieval

  • Robustness is improved by re-ranking based on geometry verification
  • Spatial information is important for image retrieval
  • Sparse local features designed for image matching perform on par or better than task-specific local features
  • SuperGlue matcher shows state-of-the-art performance

Visual localization

  • Pose recall is measured at different thresholds for each sequence.
  • Structure-based methods (AS and CSL) perform better on easier sequences.
  • Hierarchical methods perform better on more challenging sequences.
  • NV+SIFT performs better than NV+SP on easier sequences, but NV+SP performs better on more challenging sequences.

Latency and memory

  • Latency and scalability are important factors for real-time visual localization.
  • Table 4 shows computational time and memory requirements for compared methods of extracting glocal image features.
  • SuperGF shows advantages in terms of latency and scalability compared to other methods.
  • Combining results of previous sections, SuperGF uses minimal resources and generates SOTA-level global image features.
  • Sparse version of SuperGF causes a loss of retrieval accuracy, but has advantages in terms of feature extraction efficiency.
  • SuperGF is affected by the adopted local descriptor significantly.

Summary and conclusion

  • SuperGF is a transformer-based method for 6-DoF localization
  • Different versions of SuperGF are available, with learning-based and hand-craft descriptors
  • SuperPoint is recommended to be used with SuperGF
  • Dense and sparse versions of SuperGF are available
  • Training data includes one query image, one positive sample, and α + β negatives
  • Results show accuracy and efficiency of SuperGF