Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Point Transformer V2 model proposed to overcome limitations of previous work
  • Group vector attention proposed, more effective than previous version
  • Grouped weight encoding layer and position encoding multiplier proposed
  • Lightweight partition-based pooling methods designed for better spatial alignment and efficient sampling
  • Model achieves state-of-the-art performance on 3D point cloud understanding benchmarks

Paper Content

Introduction

  • Point Transformer (PTv1) introduces self-attention networks to 3D point cloud understanding
  • Point Transformer V2 (PTv2) improves upon PTv1 with novel designs
  • Grouped vector attention with improved position encoding
  • Efficient partition-based pooling scheme
  • Improved position encoding scheme to utilize point cloud coordinates better
  • Image transformers use scaled dot-product self-attention and multi-head self-attention theory from NLP
  • Point cloud understanding methods can be projection-based, voxel-based, or point-based
  • Point cloud transformers use attention to process point clouds
  • Point Transformer [1] performs local attention between each point and its adjacent points
  • Point Transformer V2 proposed to improve effectiveness and efficiency of Point Transformer

Point transformer v2

Problem formulation and background

  • Problem formulation: 3D point cloud scene contains points with positions and features
  • Goal of point cloud semantic segmentation: predict class label for each point
  • Goal of scene classification: predict class label for each scene
  • Local attention: attention works within subset of points, i.e. reference point set
  • Scalar attention: attention weights are scalars computed from scaled dot-product between query and key vectors
  • Vector attention: attention weights are vectors that modulate individual feature channels

Grouped vector attention

  • Vector attention has a large parameter size which restricts its efficiency and generalization ability.
  • Grouped vector attention (GVA) divides channels of the value vector into groups and shares the same scalar attention weight from the grouped attention vector.
  • GVA reduces the number of parameters compared to vector attention, leading to a more powerful and efficient model.

Position encoding multipler

  • 3D point clouds have a more complicated spatial relationship than 2D images.
  • Vector attention in PTv1 has a generalization limitation.
  • PTv2 uses grouped vector attention to reduce overfitting and enhance generalization.
  • Position encoding is improved with an additional multiplier to the relation vector.

Partition-based pooling

  • Traditional sampling-based pooling procedures use a combination of sampling and query methods.
  • Proposed partition-based pooling approach uses non-overlapping partitions to separate the space.
  • Pooling process fuses each subset of points from a single partition.
  • Unpooling process maps point feature to all points from the same subset.

Experiments

  • Conducted experimental evaluations on ScanNet v2, S3DIS, and ModelNet40
  • Implementation details available in appendix
  • Results: SegGCN (63.5, 66.6), RandLA-Net (-58.9), KPConv (69.2, 68.6), JSENet (-69.9), FusionNet (-68.8), SparseConvNet (69.3, 72.5), MinkUNet (72.2, 73.6), PTv1

Ablation study

  • Conducted ablation studies to examine effectiveness of each module in design
  • Examined effects of different attention designs
  • Compared grouped vector attention (GVA) with multi-head self-attention (MSA)
  • Studied effects of different weight encoding functions
  • Examined superiority of partition-based pooling over sampling-based pooling
  • Ablated different modules in PTv2
  • Baseline result increased from 70.6% to 75.4%
  • Each component added to design increased mIOU

Model complexity and latency

  • Conducted model complexity and latency studies to examine efficiency of design
  • Compared pooling methods, found grid pooling faster and higher mIoU
  • Compared model complexity, time consumption, and evaluation performances
  • Found GVA with grouped weight encoding improved model performance and slightly reduced execution time
  • Found grid pooling significantly sped up network and enhanced generalization ability
  • Position encoding multiplier increased number of model parameters but improved performance

Conclusion

  • Propose Point Transformer V2 (PTv2) for 3D point cloud understanding
  • Make non-trivial improvements on Point Transformer V1
  • Grouped vector attention, improved position encoding, and partition-based pooling
  • State-of-the-art performance on point cloud classification and semantic segmentation
  • Ablation experiments on depths of decoder blocks with different unpooling methods
  • Adding attention blocks in each decoding stage improves performance
  • Deeper decoders do not improve performance
  • Ablation study on PE Multiplier
  • Group vector attention reduces overfitting and enhances generalization
  • Feature-level pooling methods: FPS-kNN, Grid-kNN, and partition-based pooling
  • Benchmark with synthetic data
  • Attention type, pooling method, weight encoding, and module design ablation
  • Model performance and amortized latency with different pooling methods
  • Data augmentation and training setting