Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Handcrafted descriptors are rotation invariant, but deep matchers are not.
Deep matchers use data augmentation to obtain rotation invariance, but this is not always effective.
RoITr is a Rotation-Invariant Transformer to cope with pose variations in point cloud matching.
RoITr uses an attention mechanism with PPF-based coordinates to create a pose-invariant geometry.
RoITr also uses a global transformer with rotation-invariant cross-frame spatial awareness.
RoITr outperforms existing methods in low-overlapping scenarios.

Paper Content

Introduction

Correspondence estimation between partially-overlapping point clouds is a core computer vision task
Geometry is encoded into descriptors and correspondences are established by matching descriptors
Pose-invariance is key to success in point cloud matching
Handcrafted local descriptors were designed to be rotation-invariant
Deep neural models for 3D point analysis have been developed
Augmented training is used to ensure pose-invariance
Rotation-Invariant Transformer (RoITr) proposed to tackle point cloud matching under arbitrary pose variations
RoITr uses Point Pair Features (PPFs) as local coordinate representation
Local attention mechanism to learn pure local geometry regardless of poses
Attention-based layers to compose encoder-decoder architecture for rotation-invariant geometry encoding
Global transformer with rotation-invariant cross-frame position awareness to enhance feature distinctiveness

Mainstream deep learning-based point cloud matching approaches are rotation-sensitive
Pioneers learn to describe local patches from a rotation-variant input
Some methods use handcrafted descriptors to align input to a canonical representation
Deep learning-based methods are designed to be rotation-invariant
Common problem of rotation-invariant methods is less distinctive features

Method

RoITr is an encoder-decoder architecture for geometry encoding
RoITr uses a stack of global transformers for global context aggregation
Correspondences are extracted by matching features in a coarse-to-fine manner

Problem statement

Problem of matching two partially overlapping point clouds
Aim to extract a correspondence set minimizing a certain equation
Euclidean norm and set cardinality used
Ground-truth mapping function maps points in P to corresponding positions in Q
Rigid scenarios use transformation T in SE(3)
Non-rigid cases use per-point flow f i in R 3

Ppftrans for geometry description

PPFTrans consumes a triplet P = (P, N, X) with P a points cloud, N the normals estimated from P, and X the initial point features.
Output is a superpoint triplet P = (P , N , X ) with n superpoints and X ∈ R n ×c .
Anchor triplet is P A = (P A , N A , X A ) with n A points and X A ∈ R n A ×c A , and the Supporter triplet P S = (P S , N S , X S ) with n S points and X S ∈ R n S ×c S .
Constructs pose-agnostic local coordinate representation based on PPFs.
PPF Attention Module (PAM) generates Anchor features X A by aggregating pose-agnostic local geometry and learned context from P S .
Encoder consists of Attentional Abstraction Layer (AAL) followed by e×PPF Attention Layers (PALs).
Decoder consists of Transition Up Layer (TUL) followed by d× PAL.
Output of encoder is P := P A , output of decoder is P := P S .

Global transformer for context aggregation

Previous works have disentangled self- and cross-attention as individual modules
We couple them together as a global transformer
Input is a pair of triplets P and Q
Geometry-Aware Self-Attention Module mines geometric cues and aggregates global context
Position-Aware Cross-Attention Module incorporates rotation-invariant spatial representation
Attention matrix is computed via row-wise softmax
Output is a pair of triplets P and Q with enhanced features

Point matching and loss funcion

Superpoint Matching takes two triplets of points as input
Coarse-to-fine matching strategy is used
Features are normalized and similarity is measured using a Gaussian correlation matrix
Point-to-node strategy is used to assign points to superpoints
Similarity between feature groups is calculated
Sinkhorn Algorithm is used to obtain normalized similarity matrix
Loss function is used to collect final correspondence set

Experiment

Evaluated RoITr on rigid and non-rigid benchmarks
Used RANSAC for pose estimation in rigid matching
Implementation details in Appendix

Rigid indoor scenes: 3dmatch & 3dlomatch

Dataset: 3DMatch with 46 for training, 8 for validation, 8 for testing; 3DMatch and 3DLoMatch with full-range rotations added to point cloud pairs
Metrics: Inlier Ratio (IR), Feature Matching Recall (FMR), counts fraction of point cloud pairs correctly registered (RMSE < 0.2m)
Comparison with State-of-the-Art: RoITr outperforms all others with large margin on IR; FMR significantly surpasses all others on 3DLoMatch; RR comparable performance with GeoTrans and Lepard on 3DMatch
Metrics: Inlier Ratio (IR) with threshold 0.04m; Non-rigid Feature Matching Recall (NFMR)
Comparison with State-of-the-Art: RoITr outperforms all others in non-rigid matching task

Ablation study

Point-Transformer (PT) leads to a sharp performance drop when replacing PPFTrans.
Embedding PPF-based local coordinates into PT boosts performance and makes it rotation-invariant.
Relative coordinates fail to work in PAM due to more efficient attention mechanism.

Conclusion

We introduce RoITr, a rotation-invariant model for point cloud matching
We proposed PAM, AAL, PAL, and TUL to compose PPFTrans for geometry description
We enhanced features with a global transformer architecture for rotation-invariant cross-frame spatial awareness
Experiments show superiority of our approach, especially robustness against arbitrary rotations
Limitations include lack of explicit occlusion handling and inability to match symmetric structures
Implemented with PyTorch, trained on 4 Nvidia 3090 GPUs
Evaluation metrics include Inlier Ratio, Feature Matching Recall, Registration Recall, and Non-Rigid Feature Matching Recall
Runtime analysis shows RoITr has highest data preparation and overall speed, but lowest model speed

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

Problem statement#

Ppftrans for geometry description#

Global transformer for context aggregation#

Point matching and loss funcion#

Experiment#

Rigid indoor scenes: 3dmatch & 3dlomatch#

Ablation study#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

Problem statement

Ppftrans for geometry description

Global transformer for context aggregation

Point matching and loss funcion

Experiment

Rigid indoor scenes: 3dmatch & 3dlomatch

Ablation study

Conclusion