Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Point Transformer V2 model proposed to overcome limitations of previous work
Group vector attention proposed, more effective than previous version
Grouped weight encoding layer and position encoding multiplier proposed
Lightweight partition-based pooling methods designed for better spatial alignment and efficient sampling
Model achieves state-of-the-art performance on 3D point cloud understanding benchmarks

Paper Content

Introduction

Point Transformer (PTv1) introduces self-attention networks to 3D point cloud understanding
Point Transformer V2 (PTv2) improves upon PTv1 with novel designs
Grouped vector attention with improved position encoding
Efficient partition-based pooling scheme
Improved position encoding scheme to utilize point cloud coordinates better

Image transformers use scaled dot-product self-attention and multi-head self-attention theory from NLP
Point cloud understanding methods can be projection-based, voxel-based, or point-based
Point cloud transformers use attention to process point clouds
Point Transformer [1] performs local attention between each point and its adjacent points
Point Transformer V2 proposed to improve effectiveness and efficiency of Point Transformer

Point transformer v2

Problem formulation and background

Problem formulation: 3D point cloud scene contains points with positions and features
Goal of point cloud semantic segmentation: predict class label for each point
Goal of scene classification: predict class label for each scene
Local attention: attention works within subset of points, i.e. reference point set
Scalar attention: attention weights are scalars computed from scaled dot-product between query and key vectors
Vector attention: attention weights are vectors that modulate individual feature channels

Grouped vector attention

Vector attention has a large parameter size which restricts its efficiency and generalization ability.
Grouped vector attention (GVA) divides channels of the value vector into groups and shares the same scalar attention weight from the grouped attention vector.
GVA reduces the number of parameters compared to vector attention, leading to a more powerful and efficient model.

Position encoding multipler

3D point clouds have a more complicated spatial relationship than 2D images.
Vector attention in PTv1 has a generalization limitation.
PTv2 uses grouped vector attention to reduce overfitting and enhance generalization.
Position encoding is improved with an additional multiplier to the relation vector.

Partition-based pooling

Traditional sampling-based pooling procedures use a combination of sampling and query methods.
Proposed partition-based pooling approach uses non-overlapping partitions to separate the space.
Pooling process fuses each subset of points from a single partition.
Unpooling process maps point feature to all points from the same subset.

Experiments

Conducted experimental evaluations on ScanNet v2, S3DIS, and ModelNet40
Implementation details available in appendix
Results: SegGCN (63.5, 66.6), RandLA-Net (-58.9), KPConv (69.2, 68.6), JSENet (-69.9), FusionNet (-68.8), SparseConvNet (69.3, 72.5), MinkUNet (72.2, 73.6), PTv1

Ablation study

Conducted ablation studies to examine effectiveness of each module in design
Examined effects of different attention designs
Compared grouped vector attention (GVA) with multi-head self-attention (MSA)
Studied effects of different weight encoding functions
Examined superiority of partition-based pooling over sampling-based pooling
Ablated different modules in PTv2
Baseline result increased from 70.6% to 75.4%
Each component added to design increased mIOU

Model complexity and latency

Conducted model complexity and latency studies to examine efficiency of design
Compared pooling methods, found grid pooling faster and higher mIoU
Compared model complexity, time consumption, and evaluation performances
Found GVA with grouped weight encoding improved model performance and slightly reduced execution time
Found grid pooling significantly sped up network and enhanced generalization ability
Position encoding multiplier increased number of model parameters but improved performance

Conclusion

Propose Point Transformer V2 (PTv2) for 3D point cloud understanding
Make non-trivial improvements on Point Transformer V1
Grouped vector attention, improved position encoding, and partition-based pooling
State-of-the-art performance on point cloud classification and semantic segmentation
Ablation experiments on depths of decoder blocks with different unpooling methods
Adding attention blocks in each decoding stage improves performance
Deeper decoders do not improve performance
Ablation study on PE Multiplier
Group vector attention reduces overfitting and enhances generalization
Feature-level pooling methods: FPS-kNN, Grid-kNN, and partition-based pooling
Benchmark with synthetic data
Attention type, pooling method, weight encoding, and module design ablation
Model performance and amortized latency with different pooling methods
Data augmentation and training setting

Link to paper#

Abstract#

Paper Content#

Introduction#

Related works#

Point transformer v2#

Problem formulation and background#

Grouped vector attention#

Position encoding multipler#

Partition-based pooling#

Experiments#

Ablation study#

Model complexity and latency#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related works

Point transformer v2

Problem formulation and background

Grouped vector attention

Position encoding multipler

Partition-based pooling

Experiments

Ablation study

Model complexity and latency

Conclusion