Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Modern methods for autonomous driving perception use a bird’s-eye-view (BEV) representation to describe a 3D scene.
- A tri-perspective view (TPV) representation is proposed to better describe the 3D structure of a scene.
- A transformer-based TPV encoder (TPVFormer) is used to lift image features to the 3D TPV space.
- Experiments show that the model can effectively predict the semantic occupancy for all voxels.
- The model can achieve comparable performance with LiDAR-based methods on the LiDAR segmentation task.
Paper Content
Introduction
- Perceiving 3D surroundings is important for autonomous driving systems.
- Vision-based 3D perception is a promising alternative to LiDAR-based one.
- Conventional methods split 3D space into voxels and assign each a vector.
- Modern methods focus on ground plane (bird’s-eye-view) where information varies the most.
- Objects with various 3D structures are difficult to encode using a flattened vector.
- Proposed tri-perspective view (TPV) representation to describe 3D scene.
Related work
- 3D space is discretized into voxels for 3D semantic occupancy prediction
- BEV-based methods encode height information in each BEV grid for a more compact representation
- BEV-based methods project image features into 3D space followed by BEV pooling
- Implicit representations learn a continuous function to output representations of points
- Hybrid explicit-implicit representations combine computation-efficient architecture of implicit representations and better spatial awareness of explicit representations
Proposed approach
Generalizing bev to tpv
- Autonomous driving perception requires expressive and efficient representation of 3D scenes.
- Voxel representation describes 3D scene with dense cubic features.
- Bird’s-Eye-View (BEV) models 3D scene with 2D feature map.
- Tri-Perspective View (TPV) representation models 3D space at full scale without suppressing any axes.
- TPV representation has storage and computation complexity of O(HW + DH + W D).
- TPV generalizes BEV from single top view to complementary and orthogonal top, side and front views.
- TPV offers more comprehensive and fine-grained understanding of 3D surroundings while remaining efficient.
Tpvformer
- 2D backbone is used to obtain image features before feeding them into a specific encoder
- Transformer-based TPV encoder (TPVFormer) is presented to lift image features to the TPV planes
- TPV queries are used to encode view-specific information from the corresponding pillar region
- Cross-view hybrid-attention enables direct interactions among TPV queries from the same or different views
- Image cross-attention is used to lift multi-scale and multi-camera image features to the TPV planes
- Deformable attention is used to reduce computation
- TPV queries are enhanced with raw visual information from image features in HCAB blocks
- HAB blocks specialize in contextual information encoding
- TPV queries are initialized as learnable parameters and 3D positional embedding is added
- Reference points are sampled uniformly along the direction perpendicular to the top plane
- Offsets and attention weights are calculated through linear layers
- Sampled features are weighted by their attention score
Applications of tpv
- TPV planes encode 3D scene information
- TPV planes can be converted to point and voxel features
- Point features are obtained by projecting points onto TPV planes
- Voxel features are obtained by broadcasting TPV planes
- Lightweight MLP is used to predict semantic labels
Experiments
Task descriptions
- 3 experiments conducted: 3D semantic occupancy prediction, LiDAR segmentation, and semantic scene completion
- All experiments use RGB images as input
- 3D semantic occupancy prediction is a practical yet challenging task
- LiDAR segmentation task does not use point clouds as input
- Semantic scene completion uses RGB images as input and predicts occupancy and semantic label of each voxel
- Evaluation metrics: IoU of occupied voxels for SC task, mIoU of all semantic classes for SSC task
Implementation details
- Constructed two versions of TPVFormer for different performance/efficiency trade-offs
- Used ResNet101-DCN and ResNet-50 for the two versions
- Employed cross entropy loss and lovasz-softmax loss to optimize network
- Generated pseudo-per-voxel labels from sparse point cloud
- Used voxel predictions as input to both lovasz-softmax and cross-entropy losses
- Adopted 2D UNet based on pretrained EfficientNetB7 as 2D backbone for SSC task
- Employed losses from MonoScene except for relation loss
3d semantic occupancy prediction results
- TPV representation is effective in modeling 3D scenes and predicting semantic occupancy
- Querying LiDAR points results in close predictions to ground truth
- TPVFormer predicts correctly while Cylinder3D fails in some cases
- Resolution of TPV planes can be adjusted at test time without retraining the network
Lidar segmentation results
- TPVFormer is the first vision-based method for LiDAR segmentation task.
- TPVFormer achieves comparable mIoU (∼ 70%) with most LiDAR-based methods.
Semantic scene completion results
- TPVFormer outperforms all other methods in both IoU and mIoU
- TPVFormer has fewer parameters and lower computation than MonoScene
Abation study
- TPVFormer is tested on two datasets for LiDAR segmentation and semantic scene completion
- Two loss functions are used: cross entropy and lovasz-softmax
- Results show that using both voxel and point predictions as input to the loss functions yields high mIoUs
- TPVFormer performs better than BEVFormer in all configurations
- Increasing the number of HCAB blocks improves IoU, while a moderate number of HCAB and HAB blocks yields the best semantic prediction
Conclusion
- Proposed TPV representation to describe 3D scene structures
- TPVFormer model based on attention mechanism
- Visualization results show consistent semantic voxel occupancy prediction
- Comparable performance with LiDAR-based methods on nuScenes LiDAR segmentation task
- ResNet101-DCN and ResNet-50 used for TPVFormer-Base and TPVFormer-Small respectively
- AdamW optimizer with initial learning rate 2e-4 and weight decay 0.01
- Cosine learning rate scheduler with linear warming up in first 500 iterations
- Image augmentation strategy same as BEVFormer
- Trained for 24 epochs with batch size 8 on 8 A100 GPUs
- 2D UNet based on pretrained EfficientNetB7 used for 2D backbone
- TPV resolution 128x128x16 to generate 3D voxel feature tensor
- AdamW optimizer with learning rate 2e-4, weight decay 0.01 and cosine scheduler
- Comparable performance with LiDAR-based methods on nuScenes validation set
- Outperforms other methods in mIoU with clear margin for semantic scene completion
- Video demo and visualizations of 3D semantic occupancy prediction
- Table 6 and 7 show performance of TPVFormer on nuScenes and SemanticKITTI datasets