Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Handcrafted descriptors are rotation invariant, but deep matchers are not.
- Deep matchers use data augmentation to obtain rotation invariance, but this is not always effective.
- RoITr is a Rotation-Invariant Transformer to cope with pose variations in point cloud matching.
- RoITr uses an attention mechanism with PPF-based coordinates to create a pose-invariant geometry.
- RoITr also uses a global transformer with rotation-invariant cross-frame spatial awareness.
- RoITr outperforms existing methods in low-overlapping scenarios.
Paper Content
Introduction
- Correspondence estimation between partially-overlapping point clouds is a core computer vision task
- Geometry is encoded into descriptors and correspondences are established by matching descriptors
- Pose-invariance is key to success in point cloud matching
- Handcrafted local descriptors were designed to be rotation-invariant
- Deep neural models for 3D point analysis have been developed
- Augmented training is used to ensure pose-invariance
- Rotation-Invariant Transformer (RoITr) proposed to tackle point cloud matching under arbitrary pose variations
- RoITr uses Point Pair Features (PPFs) as local coordinate representation
- Local attention mechanism to learn pure local geometry regardless of poses
- Attention-based layers to compose encoder-decoder architecture for rotation-invariant geometry encoding
- Global transformer with rotation-invariant cross-frame position awareness to enhance feature distinctiveness
Related work
- Mainstream deep learning-based point cloud matching approaches are rotation-sensitive
- Pioneers learn to describe local patches from a rotation-variant input
- Some methods use handcrafted descriptors to align input to a canonical representation
- Deep learning-based methods are designed to be rotation-invariant
- Common problem of rotation-invariant methods is less distinctive features
Method
- RoITr is an encoder-decoder architecture for geometry encoding
- RoITr uses a stack of global transformers for global context aggregation
- Correspondences are extracted by matching features in a coarse-to-fine manner
Problem statement
- Problem of matching two partially overlapping point clouds
- Aim to extract a correspondence set minimizing a certain equation
- Euclidean norm and set cardinality used
- Ground-truth mapping function maps points in P to corresponding positions in Q
- Rigid scenarios use transformation T in SE(3)
- Non-rigid cases use per-point flow f i in R 3
Ppftrans for geometry description
- PPFTrans consumes a triplet P = (P, N, X) with P a points cloud, N the normals estimated from P, and X the initial point features.
- Output is a superpoint triplet P = (P , N , X ) with n superpoints and X ∈ R n ×c .
- Anchor triplet is P A = (P A , N A , X A ) with n A points and X A ∈ R n A ×c A , and the Supporter triplet P S = (P S , N S , X S ) with n S points and X S ∈ R n S ×c S .
- Constructs pose-agnostic local coordinate representation based on PPFs.
- PPF Attention Module (PAM) generates Anchor features X A by aggregating pose-agnostic local geometry and learned context from P S .
- Encoder consists of Attentional Abstraction Layer (AAL) followed by e×PPF Attention Layers (PALs).
- Decoder consists of Transition Up Layer (TUL) followed by d× PAL.
- Output of encoder is P := P A , output of decoder is P := P S .
Global transformer for context aggregation
- Previous works have disentangled self- and cross-attention as individual modules
- We couple them together as a global transformer
- Input is a pair of triplets P and Q
- Geometry-Aware Self-Attention Module mines geometric cues and aggregates global context
- Position-Aware Cross-Attention Module incorporates rotation-invariant spatial representation
- Attention matrix is computed via row-wise softmax
- Output is a pair of triplets P and Q with enhanced features
Point matching and loss funcion
- Superpoint Matching takes two triplets of points as input
- Coarse-to-fine matching strategy is used
- Features are normalized and similarity is measured using a Gaussian correlation matrix
- Point-to-node strategy is used to assign points to superpoints
- Similarity between feature groups is calculated
- Sinkhorn Algorithm is used to obtain normalized similarity matrix
- Loss function is used to collect final correspondence set
Experiment
- Evaluated RoITr on rigid and non-rigid benchmarks
- Used RANSAC for pose estimation in rigid matching
- Implementation details in Appendix
Rigid indoor scenes: 3dmatch & 3dlomatch
- Dataset: 3DMatch with 46 for training, 8 for validation, 8 for testing; 3DMatch and 3DLoMatch with full-range rotations added to point cloud pairs
- Metrics: Inlier Ratio (IR), Feature Matching Recall (FMR), counts fraction of point cloud pairs correctly registered (RMSE < 0.2m)
- Comparison with State-of-the-Art: RoITr outperforms all others with large margin on IR; FMR significantly surpasses all others on 3DLoMatch; RR comparable performance with GeoTrans and Lepard on 3DMatch
- Metrics: Inlier Ratio (IR) with threshold 0.04m; Non-rigid Feature Matching Recall (NFMR)
- Comparison with State-of-the-Art: RoITr outperforms all others in non-rigid matching task
Ablation study
- Point-Transformer (PT) leads to a sharp performance drop when replacing PPFTrans.
- Embedding PPF-based local coordinates into PT boosts performance and makes it rotation-invariant.
- Relative coordinates fail to work in PAM due to more efficient attention mechanism.
Conclusion
- We introduce RoITr, a rotation-invariant model for point cloud matching
- We proposed PAM, AAL, PAL, and TUL to compose PPFTrans for geometry description
- We enhanced features with a global transformer architecture for rotation-invariant cross-frame spatial awareness
- Experiments show superiority of our approach, especially robustness against arbitrary rotations
- Limitations include lack of explicit occlusion handling and inability to match symmetric structures
- Implemented with PyTorch, trained on 4 Nvidia 3090 GPUs
- Evaluation metrics include Inlier Ratio, Feature Matching Recall, Registration Recall, and Non-Rigid Feature Matching Recall
- Runtime analysis shows RoITr has highest data preparation and overall speed, but lowest model speed