Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Modern methods for autonomous driving perception use a bird’s-eye-view (BEV) representation to describe a 3D scene.
  • A tri-perspective view (TPV) representation is proposed to better describe the 3D structure of a scene.
  • A transformer-based TPV encoder (TPVFormer) is used to lift image features to the 3D TPV space.
  • Experiments show that the model can effectively predict the semantic occupancy for all voxels.
  • The model can achieve comparable performance with LiDAR-based methods on the LiDAR segmentation task.

Paper Content

Introduction

  • Perceiving 3D surroundings is important for autonomous driving systems.
  • Vision-based 3D perception is a promising alternative to LiDAR-based one.
  • Conventional methods split 3D space into voxels and assign each a vector.
  • Modern methods focus on ground plane (bird’s-eye-view) where information varies the most.
  • Objects with various 3D structures are difficult to encode using a flattened vector.
  • Proposed tri-perspective view (TPV) representation to describe 3D scene.
  • 3D space is discretized into voxels for 3D semantic occupancy prediction
  • BEV-based methods encode height information in each BEV grid for a more compact representation
  • BEV-based methods project image features into 3D space followed by BEV pooling
  • Implicit representations learn a continuous function to output representations of points
  • Hybrid explicit-implicit representations combine computation-efficient architecture of implicit representations and better spatial awareness of explicit representations

Proposed approach

Generalizing bev to tpv

  • Autonomous driving perception requires expressive and efficient representation of 3D scenes.
  • Voxel representation describes 3D scene with dense cubic features.
  • Bird’s-Eye-View (BEV) models 3D scene with 2D feature map.
  • Tri-Perspective View (TPV) representation models 3D space at full scale without suppressing any axes.
  • TPV representation has storage and computation complexity of O(HW + DH + W D).
  • TPV generalizes BEV from single top view to complementary and orthogonal top, side and front views.
  • TPV offers more comprehensive and fine-grained understanding of 3D surroundings while remaining efficient.

Tpvformer

  • 2D backbone is used to obtain image features before feeding them into a specific encoder
  • Transformer-based TPV encoder (TPVFormer) is presented to lift image features to the TPV planes
  • TPV queries are used to encode view-specific information from the corresponding pillar region
  • Cross-view hybrid-attention enables direct interactions among TPV queries from the same or different views
  • Image cross-attention is used to lift multi-scale and multi-camera image features to the TPV planes
  • Deformable attention is used to reduce computation
  • TPV queries are enhanced with raw visual information from image features in HCAB blocks
  • HAB blocks specialize in contextual information encoding
  • TPV queries are initialized as learnable parameters and 3D positional embedding is added
  • Reference points are sampled uniformly along the direction perpendicular to the top plane
  • Offsets and attention weights are calculated through linear layers
  • Sampled features are weighted by their attention score

Applications of tpv

  • TPV planes encode 3D scene information
  • TPV planes can be converted to point and voxel features
  • Point features are obtained by projecting points onto TPV planes
  • Voxel features are obtained by broadcasting TPV planes
  • Lightweight MLP is used to predict semantic labels

Experiments

Task descriptions

  • 3 experiments conducted: 3D semantic occupancy prediction, LiDAR segmentation, and semantic scene completion
  • All experiments use RGB images as input
  • 3D semantic occupancy prediction is a practical yet challenging task
  • LiDAR segmentation task does not use point clouds as input
  • Semantic scene completion uses RGB images as input and predicts occupancy and semantic label of each voxel
  • Evaluation metrics: IoU of occupied voxels for SC task, mIoU of all semantic classes for SSC task

Implementation details

  • Constructed two versions of TPVFormer for different performance/efficiency trade-offs
  • Used ResNet101-DCN and ResNet-50 for the two versions
  • Employed cross entropy loss and lovasz-softmax loss to optimize network
  • Generated pseudo-per-voxel labels from sparse point cloud
  • Used voxel predictions as input to both lovasz-softmax and cross-entropy losses
  • Adopted 2D UNet based on pretrained EfficientNetB7 as 2D backbone for SSC task
  • Employed losses from MonoScene except for relation loss

3d semantic occupancy prediction results

  • TPV representation is effective in modeling 3D scenes and predicting semantic occupancy
  • Querying LiDAR points results in close predictions to ground truth
  • TPVFormer predicts correctly while Cylinder3D fails in some cases
  • Resolution of TPV planes can be adjusted at test time without retraining the network

Lidar segmentation results

  • TPVFormer is the first vision-based method for LiDAR segmentation task.
  • TPVFormer achieves comparable mIoU (∼ 70%) with most LiDAR-based methods.

Semantic scene completion results

  • TPVFormer outperforms all other methods in both IoU and mIoU
  • TPVFormer has fewer parameters and lower computation than MonoScene

Abation study

  • TPVFormer is tested on two datasets for LiDAR segmentation and semantic scene completion
  • Two loss functions are used: cross entropy and lovasz-softmax
  • Results show that using both voxel and point predictions as input to the loss functions yields high mIoUs
  • TPVFormer performs better than BEVFormer in all configurations
  • Increasing the number of HCAB blocks improves IoU, while a moderate number of HCAB and HAB blocks yields the best semantic prediction

Conclusion

  • Proposed TPV representation to describe 3D scene structures
  • TPVFormer model based on attention mechanism
  • Visualization results show consistent semantic voxel occupancy prediction
  • Comparable performance with LiDAR-based methods on nuScenes LiDAR segmentation task
  • ResNet101-DCN and ResNet-50 used for TPVFormer-Base and TPVFormer-Small respectively
  • AdamW optimizer with initial learning rate 2e-4 and weight decay 0.01
  • Cosine learning rate scheduler with linear warming up in first 500 iterations
  • Image augmentation strategy same as BEVFormer
  • Trained for 24 epochs with batch size 8 on 8 A100 GPUs
  • 2D UNet based on pretrained EfficientNetB7 used for 2D backbone
  • TPV resolution 128x128x16 to generate 3D voxel feature tensor
  • AdamW optimizer with learning rate 2e-4, weight decay 0.01 and cosine scheduler
  • Comparable performance with LiDAR-based methods on nuScenes validation set
  • Outperforms other methods in mIoU with clear margin for semantic scene completion
  • Video demo and visualizations of 3D semantic occupancy prediction
  • Table 6 and 7 show performance of TPVFormer on nuScenes and SemanticKITTI datasets