Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Modern methods for autonomous driving perception use a bird’s-eye-view (BEV) representation to describe a 3D scene.
A tri-perspective view (TPV) representation is proposed to better describe the 3D structure of a scene.
A transformer-based TPV encoder (TPVFormer) is used to lift image features to the 3D TPV space.
Experiments show that the model can effectively predict the semantic occupancy for all voxels.
The model can achieve comparable performance with LiDAR-based methods on the LiDAR segmentation task.

Paper Content

Introduction

Perceiving 3D surroundings is important for autonomous driving systems.
Vision-based 3D perception is a promising alternative to LiDAR-based one.
Conventional methods split 3D space into voxels and assign each a vector.
Modern methods focus on ground plane (bird’s-eye-view) where information varies the most.
Objects with various 3D structures are difficult to encode using a flattened vector.
Proposed tri-perspective view (TPV) representation to describe 3D scene.

3D space is discretized into voxels for 3D semantic occupancy prediction
BEV-based methods encode height information in each BEV grid for a more compact representation
BEV-based methods project image features into 3D space followed by BEV pooling
Implicit representations learn a continuous function to output representations of points
Hybrid explicit-implicit representations combine computation-efficient architecture of implicit representations and better spatial awareness of explicit representations

Proposed approach

Generalizing bev to tpv

Autonomous driving perception requires expressive and efficient representation of 3D scenes.
Voxel representation describes 3D scene with dense cubic features.
Bird’s-Eye-View (BEV) models 3D scene with 2D feature map.
Tri-Perspective View (TPV) representation models 3D space at full scale without suppressing any axes.
TPV representation has storage and computation complexity of O(HW + DH + W D).
TPV generalizes BEV from single top view to complementary and orthogonal top, side and front views.
TPV offers more comprehensive and fine-grained understanding of 3D surroundings while remaining efficient.

Tpvformer

2D backbone is used to obtain image features before feeding them into a specific encoder
Transformer-based TPV encoder (TPVFormer) is presented to lift image features to the TPV planes
TPV queries are used to encode view-specific information from the corresponding pillar region
Cross-view hybrid-attention enables direct interactions among TPV queries from the same or different views
Image cross-attention is used to lift multi-scale and multi-camera image features to the TPV planes
Deformable attention is used to reduce computation
TPV queries are enhanced with raw visual information from image features in HCAB blocks
HAB blocks specialize in contextual information encoding
TPV queries are initialized as learnable parameters and 3D positional embedding is added
Reference points are sampled uniformly along the direction perpendicular to the top plane
Offsets and attention weights are calculated through linear layers
Sampled features are weighted by their attention score

Applications of tpv

TPV planes encode 3D scene information
TPV planes can be converted to point and voxel features
Point features are obtained by projecting points onto TPV planes
Voxel features are obtained by broadcasting TPV planes
Lightweight MLP is used to predict semantic labels

Experiments

Task descriptions

3 experiments conducted: 3D semantic occupancy prediction, LiDAR segmentation, and semantic scene completion
All experiments use RGB images as input
3D semantic occupancy prediction is a practical yet challenging task
LiDAR segmentation task does not use point clouds as input
Semantic scene completion uses RGB images as input and predicts occupancy and semantic label of each voxel
Evaluation metrics: IoU of occupied voxels for SC task, mIoU of all semantic classes for SSC task

Implementation details

Constructed two versions of TPVFormer for different performance/efficiency trade-offs
Used ResNet101-DCN and ResNet-50 for the two versions
Employed cross entropy loss and lovasz-softmax loss to optimize network
Generated pseudo-per-voxel labels from sparse point cloud
Used voxel predictions as input to both lovasz-softmax and cross-entropy losses
Adopted 2D UNet based on pretrained EfficientNetB7 as 2D backbone for SSC task
Employed losses from MonoScene except for relation loss

3d semantic occupancy prediction results

TPV representation is effective in modeling 3D scenes and predicting semantic occupancy
Querying LiDAR points results in close predictions to ground truth
TPVFormer predicts correctly while Cylinder3D fails in some cases
Resolution of TPV planes can be adjusted at test time without retraining the network

Lidar segmentation results

TPVFormer is the first vision-based method for LiDAR segmentation task.
TPVFormer achieves comparable mIoU (∼ 70%) with most LiDAR-based methods.

Semantic scene completion results

TPVFormer outperforms all other methods in both IoU and mIoU
TPVFormer has fewer parameters and lower computation than MonoScene

Abation study

TPVFormer is tested on two datasets for LiDAR segmentation and semantic scene completion
Two loss functions are used: cross entropy and lovasz-softmax
Results show that using both voxel and point predictions as input to the loss functions yields high mIoUs
TPVFormer performs better than BEVFormer in all configurations
Increasing the number of HCAB blocks improves IoU, while a moderate number of HCAB and HAB blocks yields the best semantic prediction

Conclusion

Proposed TPV representation to describe 3D scene structures
TPVFormer model based on attention mechanism
Visualization results show consistent semantic voxel occupancy prediction
Comparable performance with LiDAR-based methods on nuScenes LiDAR segmentation task
ResNet101-DCN and ResNet-50 used for TPVFormer-Base and TPVFormer-Small respectively
AdamW optimizer with initial learning rate 2e-4 and weight decay 0.01
Cosine learning rate scheduler with linear warming up in first 500 iterations
Image augmentation strategy same as BEVFormer
Trained for 24 epochs with batch size 8 on 8 A100 GPUs
2D UNet based on pretrained EfficientNetB7 used for 2D backbone
TPV resolution 128x128x16 to generate 3D voxel feature tensor
AdamW optimizer with learning rate 2e-4, weight decay 0.01 and cosine scheduler
Comparable performance with LiDAR-based methods on nuScenes validation set
Outperforms other methods in mIoU with clear margin for semantic scene completion
Video demo and visualizations of 3D semantic occupancy prediction
Table 6 and 7 show performance of TPVFormer on nuScenes and SemanticKITTI datasets

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Proposed approach#

Generalizing bev to tpv#

Tpvformer#

Applications of tpv#

Experiments#

Task descriptions#

Implementation details#

3d semantic occupancy prediction results#

Lidar segmentation results#

Semantic scene completion results#

Abation study#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Proposed approach

Generalizing bev to tpv

Tpvformer

Applications of tpv

Experiments

Task descriptions

Implementation details

3d semantic occupancy prediction results

Lidar segmentation results

Semantic scene completion results

Abation study

Conclusion