Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Semantic occupancy perception is essential for autonomous driving.
  • Existing benchmarks lack diversity in urban scenes and only evaluate front-view predictions.
  • OpenOccupancy is the first surrounding semantic occupancy perception benchmark.
  • Annotations rely on LiDAR points superimposition, which can miss some occupancy labels.
  • Augmenting And Purifying (AAP) pipeline is used to densify the annotations.
  • Camera-based, LiDAR-based and multi-modal baselines are established for the OpenOccupancy benchmark.
  • Cascade Occupancy Network (CONet) is proposed to refine the coarse prediction.

Paper Content

Introduction

  • Accurately perceiving 3D structures of different objects and regions in urban scenes is important for safe driving
  • There is growing interest in semantic occupancy perception
  • Semantic occupancy perception is a challenging research direction
  • Most relevant benchmarks are for indoor scenes
  • SemanticKITTI extends occupancy perception to driving scenarios but is small in scale and limited in diversity
  • OpenOccupancy is the first surrounding semantic occupancy perception benchmark
  • nuScenes-Occupancy is the first dataset for surrounding semantic occupancy segmentation
  • Established camera-based, LiDAR-based and multi-modal baselines in the OpenOccupancy benchmark
  • Cascade Occupancy Network (CONet) proposed to alleviate computational burden of high-resolution occupancy predictions
  • Comprehensive experiments conducted on proposed baselines, CONet, and modern occupancy perception approaches
  • Semantic occupancy perception originates from SUNCG
  • Various relevant benchmarks have been released
  • Most existing occupancy perception methods rely on geometric inputs
  • MonoScene is the first camera-based occupancy perception method
  • TPVFormer proposes a tri-perspective view representation

The openoccupancy benchmark

  • Introduced concept of surrounding semantic occupancy perception
  • Introduced nuScenes-Occupancy dataset
  • Presented evaluation protocol to assess surrounding occupancy perception algorithms
  • Proposed camera-based, LiDAR-based and multi-modal baselines for OpenOccupancy Benchmark

Surrounding semantic occupancy perception

  • Generating a 3D representation of volumetric occupancy and semantic labels for a scene
  • Monocular paradigm focuses on front-view perception, surrounding occupancy perception algorithms target at producing semantic occupancy in surround-view driving scenarios
  • Given 360-degree inputs, algorithms are required to predict surrounding occupancy labels
  • Surround-view inputs cover 5x more perceptive range than front-view sensors
  • Core challenge of surrounding occupancy perception is efficiently constructing high-resolution occupancy

Nuscenes-occupancy

  • SemanticKITTI is the first dataset for outdoor occupancy perception, but it is limited.
  • nuScenes-Occupancy is a large-scale surrounding occupancy perception dataset.
  • Annotation is initialized by LiDAR points superimposition.
  • Pseudo labels are produced by a pretrained model.
  • Human effort is used to purify the augmented labels.

Evaluation protocol

  • Evaluation range is set to [-51.2m, 51.2m] for X, Y axis and [-3m, 5m] for Z axis
  • Voxel resolution is 0.2m, resulting in 40 x 512 x 512 voxels for occupancy prediction
  • Utilize Intersection of Union (IoU) as geometric metric
  • Calculate mean IoU (mIoU) of each class as semantic metric
  • Noise class is ignored in evaluation

Openoccupancy baselines

  • Existing occupancy perception methods are designed for front-view perception
  • To extend these approaches to surrounding occupancy perception, each camera-view input is processed individually
  • Inconsistency may exist in the overlap region of two adjacent outputs
  • Establish baselines to learn surrounding semantic occupancy from 360-degree inputs
  • Baselines include camera-based, LiDAR-based and multi-modal
  • LiDAR-based baseline uses parameterized voxelization and 3D sparse convolutions
  • Camera-based baseline uses 2D encoder and 2D to 3D view transform
  • Multi-modal baseline uses adaptive fusion module to integrate LiDAR and camera voxel features
  • Loss functions include cross-entropy, lovasz-softmax, affinity and explicit depth supervision

Cascade occupancy network

  • The input of the surrounding occupancy perception covers 5x the perceptive range compared to front-view occupancy perception.
  • Stride parameter S is set to 4 in the proposed baselines.
  • Smaller stride parameter (e.g. S=2) enhances performance but increases GPU memory.
  • Cascade Occupancy Network (CONet) is proposed for efficient yet accurate surrounding occupancy perception.
  • CONet introduces a coarse-to-fine pipeline which can be built upon the proposed baselines.
  • CONet can be generalized to camera-based and LiDAR-based baselines.

Openoccupancy experiment

  • Experiment setup is given
  • Discusses camera-based, LiDAR-based and multimodal methods
  • Analyzes baseline performance under different settings
  • Investigates efficiency and effectiveness of CONet

Experiment setup

  • Evaluate surrounding semantic occupancy segmentation performance
  • Use nuScenes-Occupancy for comprehensive experiments
  • Single-view methods generate occupancy predictions for each view individually
  • LiDAR-based methods use 10 LiDAR sweeps as input
  • Camera-based methods use 1600 ร— 900 image size
  • Use ImageNet pretrained ResNet50 as backbone
  • Utilize trilinear interpolation to upsample outputs
  • Leverage AdamW optimizer with cosine learning rate scheduler
  • Train models for 24 epochs with batch size of 8 on 8 A100 GPUs

Surrounding occupancy assessment

  • Six modern approaches are analyzed for surrounding occupancy perception performance
  • Single-view methods are outperformed by surrounding occupancy perception paradigms
  • Proposed baselines show adaptability and scalability
  • Multi-modal baseline significantly enhances performance
  • CONet efficiently refines low-resolution predictions

Baselines under different settings

  • Using a larger input size improves IoU and mIoU by 16% and 20%.
  • Replacing ResNet50 with ResNet101 further enhances mIoU by 11%.
  • Utilizing multi-sweeps as input improves single-sweep counterpart by 43% and 5% on IoU and mIoU.
  • Adaptive fusion relatively enhances mIoU by 6% and 5%.

Efficiency and effectiveness of conet

  • Proposed baselines generate low-resolution predictions
  • Smaller stride parameter enhances performance but increases GPU memory and GFLOPs
  • CONet achieves better performance than high-resolution baselines while reducing GPU memory and GFLOPs
  • Ablation study shows combining 2D semantic and geometric features improves performance by 33%

Conclusion

  • Propose OpenOccupancy benchmark for surrounding semantic occupancy perception in driving scenarios
  • Introduce nuScenes-Occupancy dataset with dense semantic occupancy annotations
  • Establish camera-based, LiDAR-based and multimodal baselines
  • Propose CONet to alleviate computational burden of high-resolution occupancy predictions
  • Results show camera-based and LiDAR-based baselines are complementary and multi-modal baseline enhances performance by 47% and 29%
  • CONet improves baseline by โˆผ30% with minimal latency overhead