Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Point Transformer V2 model proposed to overcome limitations of previous work
- Group vector attention proposed, more effective than previous version
- Grouped weight encoding layer and position encoding multiplier proposed
- Lightweight partition-based pooling methods designed for better spatial alignment and efficient sampling
- Model achieves state-of-the-art performance on 3D point cloud understanding benchmarks
Paper Content
Introduction
- Point Transformer (PTv1) introduces self-attention networks to 3D point cloud understanding
- Point Transformer V2 (PTv2) improves upon PTv1 with novel designs
- Grouped vector attention with improved position encoding
- Efficient partition-based pooling scheme
- Improved position encoding scheme to utilize point cloud coordinates better
Related works
- Image transformers use scaled dot-product self-attention and multi-head self-attention theory from NLP
- Point cloud understanding methods can be projection-based, voxel-based, or point-based
- Point cloud transformers use attention to process point clouds
- Point Transformer [1] performs local attention between each point and its adjacent points
- Point Transformer V2 proposed to improve effectiveness and efficiency of Point Transformer
Point transformer v2
Problem formulation and background
- Problem formulation: 3D point cloud scene contains points with positions and features
- Goal of point cloud semantic segmentation: predict class label for each point
- Goal of scene classification: predict class label for each scene
- Local attention: attention works within subset of points, i.e. reference point set
- Scalar attention: attention weights are scalars computed from scaled dot-product between query and key vectors
- Vector attention: attention weights are vectors that modulate individual feature channels
Grouped vector attention
- Vector attention has a large parameter size which restricts its efficiency and generalization ability.
- Grouped vector attention (GVA) divides channels of the value vector into groups and shares the same scalar attention weight from the grouped attention vector.
- GVA reduces the number of parameters compared to vector attention, leading to a more powerful and efficient model.
Position encoding multipler
- 3D point clouds have a more complicated spatial relationship than 2D images.
- Vector attention in PTv1 has a generalization limitation.
- PTv2 uses grouped vector attention to reduce overfitting and enhance generalization.
- Position encoding is improved with an additional multiplier to the relation vector.
Partition-based pooling
- Traditional sampling-based pooling procedures use a combination of sampling and query methods.
- Proposed partition-based pooling approach uses non-overlapping partitions to separate the space.
- Pooling process fuses each subset of points from a single partition.
- Unpooling process maps point feature to all points from the same subset.
Experiments
- Conducted experimental evaluations on ScanNet v2, S3DIS, and ModelNet40
- Implementation details available in appendix
- Results: SegGCN (63.5, 66.6), RandLA-Net (-58.9), KPConv (69.2, 68.6), JSENet (-69.9), FusionNet (-68.8), SparseConvNet (69.3, 72.5), MinkUNet (72.2, 73.6), PTv1
Ablation study
- Conducted ablation studies to examine effectiveness of each module in design
- Examined effects of different attention designs
- Compared grouped vector attention (GVA) with multi-head self-attention (MSA)
- Studied effects of different weight encoding functions
- Examined superiority of partition-based pooling over sampling-based pooling
- Ablated different modules in PTv2
- Baseline result increased from 70.6% to 75.4%
- Each component added to design increased mIOU
Model complexity and latency
- Conducted model complexity and latency studies to examine efficiency of design
- Compared pooling methods, found grid pooling faster and higher mIoU
- Compared model complexity, time consumption, and evaluation performances
- Found GVA with grouped weight encoding improved model performance and slightly reduced execution time
- Found grid pooling significantly sped up network and enhanced generalization ability
- Position encoding multiplier increased number of model parameters but improved performance
Conclusion
- Propose Point Transformer V2 (PTv2) for 3D point cloud understanding
- Make non-trivial improvements on Point Transformer V1
- Grouped vector attention, improved position encoding, and partition-based pooling
- State-of-the-art performance on point cloud classification and semantic segmentation
- Ablation experiments on depths of decoder blocks with different unpooling methods
- Adding attention blocks in each decoding stage improves performance
- Deeper decoders do not improve performance
- Ablation study on PE Multiplier
- Group vector attention reduces overfitting and enhances generalization
- Feature-level pooling methods: FPS-kNN, Grid-kNN, and partition-based pooling
- Benchmark with synthetic data
- Attention type, pooling method, weight encoding, and module design ablation
- Model performance and amortized latency with different pooling methods
- Data augmentation and training setting