Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Semantic occupancy perception is essential for autonomous driving.
Existing benchmarks lack diversity in urban scenes and only evaluate front-view predictions.
OpenOccupancy is the first surrounding semantic occupancy perception benchmark.
Annotations rely on LiDAR points superimposition, which can miss some occupancy labels.
Augmenting And Purifying (AAP) pipeline is used to densify the annotations.
Camera-based, LiDAR-based and multi-modal baselines are established for the OpenOccupancy benchmark.
Cascade Occupancy Network (CONet) is proposed to refine the coarse prediction.

Paper Content

Introduction

Accurately perceiving 3D structures of different objects and regions in urban scenes is important for safe driving
There is growing interest in semantic occupancy perception
Semantic occupancy perception is a challenging research direction
Most relevant benchmarks are for indoor scenes
SemanticKITTI extends occupancy perception to driving scenarios but is small in scale and limited in diversity
OpenOccupancy is the first surrounding semantic occupancy perception benchmark
nuScenes-Occupancy is the first dataset for surrounding semantic occupancy segmentation
Established camera-based, LiDAR-based and multi-modal baselines in the OpenOccupancy benchmark
Cascade Occupancy Network (CONet) proposed to alleviate computational burden of high-resolution occupancy predictions
Comprehensive experiments conducted on proposed baselines, CONet, and modern occupancy perception approaches

Semantic occupancy perception originates from SUNCG
Various relevant benchmarks have been released
Most existing occupancy perception methods rely on geometric inputs
MonoScene is the first camera-based occupancy perception method
TPVFormer proposes a tri-perspective view representation

The openoccupancy benchmark

Introduced concept of surrounding semantic occupancy perception
Introduced nuScenes-Occupancy dataset
Presented evaluation protocol to assess surrounding occupancy perception algorithms
Proposed camera-based, LiDAR-based and multi-modal baselines for OpenOccupancy Benchmark

Surrounding semantic occupancy perception

Generating a 3D representation of volumetric occupancy and semantic labels for a scene
Monocular paradigm focuses on front-view perception, surrounding occupancy perception algorithms target at producing semantic occupancy in surround-view driving scenarios
Given 360-degree inputs, algorithms are required to predict surrounding occupancy labels
Surround-view inputs cover 5x more perceptive range than front-view sensors
Core challenge of surrounding occupancy perception is efficiently constructing high-resolution occupancy

Nuscenes-occupancy

SemanticKITTI is the first dataset for outdoor occupancy perception, but it is limited.
nuScenes-Occupancy is a large-scale surrounding occupancy perception dataset.
Annotation is initialized by LiDAR points superimposition.
Pseudo labels are produced by a pretrained model.
Human effort is used to purify the augmented labels.

Evaluation protocol

Evaluation range is set to [-51.2m, 51.2m] for X, Y axis and [-3m, 5m] for Z axis
Voxel resolution is 0.2m, resulting in 40 x 512 x 512 voxels for occupancy prediction
Utilize Intersection of Union (IoU) as geometric metric
Calculate mean IoU (mIoU) of each class as semantic metric
Noise class is ignored in evaluation

Openoccupancy baselines

Existing occupancy perception methods are designed for front-view perception
To extend these approaches to surrounding occupancy perception, each camera-view input is processed individually
Inconsistency may exist in the overlap region of two adjacent outputs
Establish baselines to learn surrounding semantic occupancy from 360-degree inputs
Baselines include camera-based, LiDAR-based and multi-modal
LiDAR-based baseline uses parameterized voxelization and 3D sparse convolutions
Camera-based baseline uses 2D encoder and 2D to 3D view transform
Multi-modal baseline uses adaptive fusion module to integrate LiDAR and camera voxel features
Loss functions include cross-entropy, lovasz-softmax, affinity and explicit depth supervision

Cascade occupancy network

The input of the surrounding occupancy perception covers 5x the perceptive range compared to front-view occupancy perception.
Stride parameter S is set to 4 in the proposed baselines.
Smaller stride parameter (e.g. S=2) enhances performance but increases GPU memory.
Cascade Occupancy Network (CONet) is proposed for efficient yet accurate surrounding occupancy perception.
CONet introduces a coarse-to-fine pipeline which can be built upon the proposed baselines.
CONet can be generalized to camera-based and LiDAR-based baselines.

Openoccupancy experiment

Experiment setup is given
Discusses camera-based, LiDAR-based and multimodal methods
Analyzes baseline performance under different settings
Investigates efficiency and effectiveness of CONet

Experiment setup

Evaluate surrounding semantic occupancy segmentation performance
Use nuScenes-Occupancy for comprehensive experiments
Single-view methods generate occupancy predictions for each view individually
LiDAR-based methods use 10 LiDAR sweeps as input
Camera-based methods use 1600 × 900 image size
Use ImageNet pretrained ResNet50 as backbone
Utilize trilinear interpolation to upsample outputs
Leverage AdamW optimizer with cosine learning rate scheduler
Train models for 24 epochs with batch size of 8 on 8 A100 GPUs

Surrounding occupancy assessment

Six modern approaches are analyzed for surrounding occupancy perception performance
Single-view methods are outperformed by surrounding occupancy perception paradigms
Proposed baselines show adaptability and scalability
Multi-modal baseline significantly enhances performance
CONet efficiently refines low-resolution predictions

Baselines under different settings

Using a larger input size improves IoU and mIoU by 16% and 20%.
Replacing ResNet50 with ResNet101 further enhances mIoU by 11%.
Utilizing multi-sweeps as input improves single-sweep counterpart by 43% and 5% on IoU and mIoU.
Adaptive fusion relatively enhances mIoU by 6% and 5%.

Efficiency and effectiveness of conet

Proposed baselines generate low-resolution predictions
Smaller stride parameter enhances performance but increases GPU memory and GFLOPs
CONet achieves better performance than high-resolution baselines while reducing GPU memory and GFLOPs
Ablation study shows combining 2D semantic and geometric features improves performance by 33%

Conclusion

Propose OpenOccupancy benchmark for surrounding semantic occupancy perception in driving scenarios
Introduce nuScenes-Occupancy dataset with dense semantic occupancy annotations
Establish camera-based, LiDAR-based and multimodal baselines
Propose CONet to alleviate computational burden of high-resolution occupancy predictions
Results show camera-based and LiDAR-based baselines are complementary and multi-modal baseline enhances performance by 47% and 29%
CONet improves baseline by ∼30% with minimal latency overhead

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

The openoccupancy benchmark#

Surrounding semantic occupancy perception#

Nuscenes-occupancy#

Evaluation protocol#

Openoccupancy baselines#

Cascade occupancy network#

Openoccupancy experiment#

Experiment setup#

Surrounding occupancy assessment#

Baselines under different settings#

Efficiency and effectiveness of conet#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

The openoccupancy benchmark

Surrounding semantic occupancy perception

Nuscenes-occupancy

Evaluation protocol

Openoccupancy baselines

Cascade occupancy network

Openoccupancy experiment

Experiment setup

Surrounding occupancy assessment

Baselines under different settings

Efficiency and effectiveness of conet

Conclusion