Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Semantic occupancy perception is essential for autonomous driving.
- Existing benchmarks lack diversity in urban scenes and only evaluate front-view predictions.
- OpenOccupancy is the first surrounding semantic occupancy perception benchmark.
- Annotations rely on LiDAR points superimposition, which can miss some occupancy labels.
- Augmenting And Purifying (AAP) pipeline is used to densify the annotations.
- Camera-based, LiDAR-based and multi-modal baselines are established for the OpenOccupancy benchmark.
- Cascade Occupancy Network (CONet) is proposed to refine the coarse prediction.
Paper Content
Introduction
- Accurately perceiving 3D structures of different objects and regions in urban scenes is important for safe driving
- There is growing interest in semantic occupancy perception
- Semantic occupancy perception is a challenging research direction
- Most relevant benchmarks are for indoor scenes
- SemanticKITTI extends occupancy perception to driving scenarios but is small in scale and limited in diversity
- OpenOccupancy is the first surrounding semantic occupancy perception benchmark
- nuScenes-Occupancy is the first dataset for surrounding semantic occupancy segmentation
- Established camera-based, LiDAR-based and multi-modal baselines in the OpenOccupancy benchmark
- Cascade Occupancy Network (CONet) proposed to alleviate computational burden of high-resolution occupancy predictions
- Comprehensive experiments conducted on proposed baselines, CONet, and modern occupancy perception approaches
Related work
- Semantic occupancy perception originates from SUNCG
- Various relevant benchmarks have been released
- Most existing occupancy perception methods rely on geometric inputs
- MonoScene is the first camera-based occupancy perception method
- TPVFormer proposes a tri-perspective view representation
The openoccupancy benchmark
- Introduced concept of surrounding semantic occupancy perception
- Introduced nuScenes-Occupancy dataset
- Presented evaluation protocol to assess surrounding occupancy perception algorithms
- Proposed camera-based, LiDAR-based and multi-modal baselines for OpenOccupancy Benchmark
Surrounding semantic occupancy perception
- Generating a 3D representation of volumetric occupancy and semantic labels for a scene
- Monocular paradigm focuses on front-view perception, surrounding occupancy perception algorithms target at producing semantic occupancy in surround-view driving scenarios
- Given 360-degree inputs, algorithms are required to predict surrounding occupancy labels
- Surround-view inputs cover 5x more perceptive range than front-view sensors
- Core challenge of surrounding occupancy perception is efficiently constructing high-resolution occupancy
Nuscenes-occupancy
- SemanticKITTI is the first dataset for outdoor occupancy perception, but it is limited.
- nuScenes-Occupancy is a large-scale surrounding occupancy perception dataset.
- Annotation is initialized by LiDAR points superimposition.
- Pseudo labels are produced by a pretrained model.
- Human effort is used to purify the augmented labels.
Evaluation protocol
- Evaluation range is set to [-51.2m, 51.2m] for X, Y axis and [-3m, 5m] for Z axis
- Voxel resolution is 0.2m, resulting in 40 x 512 x 512 voxels for occupancy prediction
- Utilize Intersection of Union (IoU) as geometric metric
- Calculate mean IoU (mIoU) of each class as semantic metric
- Noise class is ignored in evaluation
Openoccupancy baselines
- Existing occupancy perception methods are designed for front-view perception
- To extend these approaches to surrounding occupancy perception, each camera-view input is processed individually
- Inconsistency may exist in the overlap region of two adjacent outputs
- Establish baselines to learn surrounding semantic occupancy from 360-degree inputs
- Baselines include camera-based, LiDAR-based and multi-modal
- LiDAR-based baseline uses parameterized voxelization and 3D sparse convolutions
- Camera-based baseline uses 2D encoder and 2D to 3D view transform
- Multi-modal baseline uses adaptive fusion module to integrate LiDAR and camera voxel features
- Loss functions include cross-entropy, lovasz-softmax, affinity and explicit depth supervision
Cascade occupancy network
- The input of the surrounding occupancy perception covers 5x the perceptive range compared to front-view occupancy perception.
- Stride parameter S is set to 4 in the proposed baselines.
- Smaller stride parameter (e.g. S=2) enhances performance but increases GPU memory.
- Cascade Occupancy Network (CONet) is proposed for efficient yet accurate surrounding occupancy perception.
- CONet introduces a coarse-to-fine pipeline which can be built upon the proposed baselines.
- CONet can be generalized to camera-based and LiDAR-based baselines.
Openoccupancy experiment
- Experiment setup is given
- Discusses camera-based, LiDAR-based and multimodal methods
- Analyzes baseline performance under different settings
- Investigates efficiency and effectiveness of CONet
Experiment setup
- Evaluate surrounding semantic occupancy segmentation performance
- Use nuScenes-Occupancy for comprehensive experiments
- Single-view methods generate occupancy predictions for each view individually
- LiDAR-based methods use 10 LiDAR sweeps as input
- Camera-based methods use 1600 ร 900 image size
- Use ImageNet pretrained ResNet50 as backbone
- Utilize trilinear interpolation to upsample outputs
- Leverage AdamW optimizer with cosine learning rate scheduler
- Train models for 24 epochs with batch size of 8 on 8 A100 GPUs
Surrounding occupancy assessment
- Six modern approaches are analyzed for surrounding occupancy perception performance
- Single-view methods are outperformed by surrounding occupancy perception paradigms
- Proposed baselines show adaptability and scalability
- Multi-modal baseline significantly enhances performance
- CONet efficiently refines low-resolution predictions
Baselines under different settings
- Using a larger input size improves IoU and mIoU by 16% and 20%.
- Replacing ResNet50 with ResNet101 further enhances mIoU by 11%.
- Utilizing multi-sweeps as input improves single-sweep counterpart by 43% and 5% on IoU and mIoU.
- Adaptive fusion relatively enhances mIoU by 6% and 5%.
Efficiency and effectiveness of conet
- Proposed baselines generate low-resolution predictions
- Smaller stride parameter enhances performance but increases GPU memory and GFLOPs
- CONet achieves better performance than high-resolution baselines while reducing GPU memory and GFLOPs
- Ablation study shows combining 2D semantic and geometric features improves performance by 33%
Conclusion
- Propose OpenOccupancy benchmark for surrounding semantic occupancy perception in driving scenarios
- Introduce nuScenes-Occupancy dataset with dense semantic occupancy annotations
- Establish camera-based, LiDAR-based and multimodal baselines
- Propose CONet to alleviate computational burden of high-resolution occupancy predictions
- Results show camera-based and LiDAR-based baselines are complementary and multi-modal baseline enhances performance by 47% and 29%
- CONet improves baseline by โผ30% with minimal latency overhead