Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Handcrafted descriptors are rotation invariant, but deep matchers are not.
  • Deep matchers use data augmentation to obtain rotation invariance, but this is not always effective.
  • RoITr is a Rotation-Invariant Transformer to cope with pose variations in point cloud matching.
  • RoITr uses an attention mechanism with PPF-based coordinates to create a pose-invariant geometry.
  • RoITr also uses a global transformer with rotation-invariant cross-frame spatial awareness.
  • RoITr outperforms existing methods in low-overlapping scenarios.

Paper Content

Introduction

  • Correspondence estimation between partially-overlapping point clouds is a core computer vision task
  • Geometry is encoded into descriptors and correspondences are established by matching descriptors
  • Pose-invariance is key to success in point cloud matching
  • Handcrafted local descriptors were designed to be rotation-invariant
  • Deep neural models for 3D point analysis have been developed
  • Augmented training is used to ensure pose-invariance
  • Rotation-Invariant Transformer (RoITr) proposed to tackle point cloud matching under arbitrary pose variations
  • RoITr uses Point Pair Features (PPFs) as local coordinate representation
  • Local attention mechanism to learn pure local geometry regardless of poses
  • Attention-based layers to compose encoder-decoder architecture for rotation-invariant geometry encoding
  • Global transformer with rotation-invariant cross-frame position awareness to enhance feature distinctiveness
  • Mainstream deep learning-based point cloud matching approaches are rotation-sensitive
  • Pioneers learn to describe local patches from a rotation-variant input
  • Some methods use handcrafted descriptors to align input to a canonical representation
  • Deep learning-based methods are designed to be rotation-invariant
  • Common problem of rotation-invariant methods is less distinctive features

Method

  • RoITr is an encoder-decoder architecture for geometry encoding
  • RoITr uses a stack of global transformers for global context aggregation
  • Correspondences are extracted by matching features in a coarse-to-fine manner

Problem statement

  • Problem of matching two partially overlapping point clouds
  • Aim to extract a correspondence set minimizing a certain equation
  • Euclidean norm and set cardinality used
  • Ground-truth mapping function maps points in P to corresponding positions in Q
  • Rigid scenarios use transformation T in SE(3)
  • Non-rigid cases use per-point flow f i in R 3

Ppftrans for geometry description

  • PPFTrans consumes a triplet P = (P, N, X) with P a points cloud, N the normals estimated from P, and X the initial point features.
  • Output is a superpoint triplet P = (P , N , X ) with n superpoints and X ∈ R n ×c .
  • Anchor triplet is P A = (P A , N A , X A ) with n A points and X A ∈ R n A ×c A , and the Supporter triplet P S = (P S , N S , X S ) with n S points and X S ∈ R n S ×c S .
  • Constructs pose-agnostic local coordinate representation based on PPFs.
  • PPF Attention Module (PAM) generates Anchor features X A by aggregating pose-agnostic local geometry and learned context from P S .
  • Encoder consists of Attentional Abstraction Layer (AAL) followed by e×PPF Attention Layers (PALs).
  • Decoder consists of Transition Up Layer (TUL) followed by d× PAL.
  • Output of encoder is P := P A , output of decoder is P := P S .

Global transformer for context aggregation

  • Previous works have disentangled self- and cross-attention as individual modules
  • We couple them together as a global transformer
  • Input is a pair of triplets P and Q
  • Geometry-Aware Self-Attention Module mines geometric cues and aggregates global context
  • Position-Aware Cross-Attention Module incorporates rotation-invariant spatial representation
  • Attention matrix is computed via row-wise softmax
  • Output is a pair of triplets P and Q with enhanced features

Point matching and loss funcion

  • Superpoint Matching takes two triplets of points as input
  • Coarse-to-fine matching strategy is used
  • Features are normalized and similarity is measured using a Gaussian correlation matrix
  • Point-to-node strategy is used to assign points to superpoints
  • Similarity between feature groups is calculated
  • Sinkhorn Algorithm is used to obtain normalized similarity matrix
  • Loss function is used to collect final correspondence set

Experiment

  • Evaluated RoITr on rigid and non-rigid benchmarks
  • Used RANSAC for pose estimation in rigid matching
  • Implementation details in Appendix

Rigid indoor scenes: 3dmatch & 3dlomatch

  • Dataset: 3DMatch with 46 for training, 8 for validation, 8 for testing; 3DMatch and 3DLoMatch with full-range rotations added to point cloud pairs
  • Metrics: Inlier Ratio (IR), Feature Matching Recall (FMR), counts fraction of point cloud pairs correctly registered (RMSE < 0.2m)
  • Comparison with State-of-the-Art: RoITr outperforms all others with large margin on IR; FMR significantly surpasses all others on 3DLoMatch; RR comparable performance with GeoTrans and Lepard on 3DMatch
  • Metrics: Inlier Ratio (IR) with threshold 0.04m; Non-rigid Feature Matching Recall (NFMR)
  • Comparison with State-of-the-Art: RoITr outperforms all others in non-rigid matching task

Ablation study

  • Point-Transformer (PT) leads to a sharp performance drop when replacing PPFTrans.
  • Embedding PPF-based local coordinates into PT boosts performance and makes it rotation-invariant.
  • Relative coordinates fail to work in PAM due to more efficient attention mechanism.

Conclusion

  • We introduce RoITr, a rotation-invariant model for point cloud matching
  • We proposed PAM, AAL, PAL, and TUL to compose PPFTrans for geometry description
  • We enhanced features with a global transformer architecture for rotation-invariant cross-frame spatial awareness
  • Experiments show superiority of our approach, especially robustness against arbitrary rotations
  • Limitations include lack of explicit occlusion handling and inability to match symmetric structures
  • Implemented with PyTorch, trained on 4 Nvidia 3090 GPUs
  • Evaluation metrics include Inlier Ratio, Feature Matching Recall, Registration Recall, and Non-Rigid Feature Matching Recall
  • Runtime analysis shows RoITr has highest data preparation and overall speed, but lowest model speed