Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Vision-centric perception has been used in autonomous driving tasks.
  • Traditional benchmarks do not consider inference time delay.
  • ASAP is the first benchmark to evaluate online performance of vision-centric perception.
  • Annotation-extending pipeline is used to generate high-frame-rate labels.
  • SPUR evaluation protocol is constructed to evaluate performance under different computational resources.
  • Model rank alters under different constraints.
  • Baselines for camera-based streaming 3D detection are established.

Paper Content

Introduction

  • ASAP benchmark proposed to evaluate accuracy-latency trade-off of camera-based perception methods
  • Annotation-extending pipeline proposed to annotate 12Hz raw images of nuScenes dataset
  • Simple baselines established in ASAP benchmark to alleviate influence of inference delay
  • SPUR evaluation protocol constructed to evaluate streaming performance of proposed baselines and 7 modern camera-based 3D detectors under various computational constraints
  • Baseline BEVDepth-Sv consistently improves streaming performance on different platforms
  • Model rank changes under different computational resources
  • Offline performance cannot serve as deterministic criterion for different approaches

Autonomous-driving benchmark

  • Last decade has seen progress in autonomous-driving perception
  • Several benchmarks focus on 2D annotation
  • Several benchmarks collect multi-modal data with 3D annotations
  • Surround-view image data boosts camera-based 3D perception
  • Vision-centric 3D perception trend shows promising accuracy
  • Benchmarks evaluate perception methods in an offline manner, neglecting inference time delay

Vision-centric driving perception

  • Cameras can be deployed with lower budgets than LiDAR
  • Cameras can extract rich semantic information from color and texture
  • BEV representation promotes development of vision-centric perception
  • BEV-based methods have achieved promising detection accuracy, approaching LiDAR-based counterparts
  • Runtime of most methods exceeds 300ms, not suitable for practical deployment

Streaming perception

  • Streaming perception is a concept proposed in [34] to evaluate accuracy-latency trade-off of 2D detectors.
  • Kalman filter, dynamic scheduling, and reinforcement learning are used to reduce problems caused by inference time delay.
  • [71] simplifies streaming perception to predicting the next frame with an efficient detector.
  • [25] and [17] use recurrent neural networks to process LiDAR slices.
  • [6] uses polar-pillar representation for LiDAR slices.
  • Streaming paradigm of vision-centric perception in autonomous driving is still under investigation.

The asap benchmark

  • ASAP concept introduced
  • Difficulty to evaluate streaming algorithms on original nuScenes dataset
  • High frame-rate nuScenes-H dataset introduced
  • SPUR evaluation protocol presented to assess streaming performance
  • Simple baselines proposed to alleviate inference time delay in streaming detection
  • ASAP benchmark evaluates most recent prediction if processing of current frame is not finished

Autonomous-driving streaming perception

  • Streaming paradigm evaluates performance at every input timestamp
  • Inputs are surround-view images at timestamp t i
  • Predictions are not synchronized with input timestamps
  • Evaluation metric is L(โ€ข)
  • Evaluation is illustrated in Fig. 2

Nuscenes-h

  • The nuScenes dataset is a popular benchmark for autonomous driving perception.
  • Annotation frame rate of the original nuScenes dataset is 2Hz, which is slower than the inference speed of most camera-based 3D detectors.
  • An annotation-extending pipeline is proposed to annotate the 12Hz raw images.
  • An object interpolation is used to calculate intermediate annotations.

Spur evaluation protocol

  • Designed the Streaming Perception Under constRained-computation (SPUR) evaluation protocol to investigate streaming performance of 3D detectors
  • Introduced streaming metrics such as Average Translation Error (ATE), Average Scale Error (ASE), Average Orientation Error (AOE), Average Attribute Error (AAE), NuScenes Detection Score (NDS) and mean Average Precision (mAP)
  • Investigated two computation-constrained evaluation protocols: varying platforms and sharing computational resources

Asap baselines

  • ASAP benchmark evaluates most recent predictions if current computation is not finished, resulting in mismatch between previously processed observation and current one.
  • To mitigate mismatch problem, forecasting future state is a simple solution.
  • Establish velocity-based baseline to update future states by predicted object motion.
  • Investigate learning-based baseline to directly estimate future locations of objects.
  • Velocity-based updating strategy is applied to any modern 3D detector.
  • Learning-based forecasting baseline directly forecasts future locations of objects.

Streaming evaluation on asap benchmark

  • Experiment setup is given
  • Computation-constrained assessment is discussed
  • Streaming evaluation with platforms altering and resource sharing is discussed
  • Analysis of association between streaming performance and input size/backbone selection is done

Experiment setup

  • ASAP benchmark uses camera-based 3D detection for vision-centric perception
  • nuScenes-H dataset is used to evaluate 3D detectors
  • Evaluation is conducted with hardware-dependent simulator
  • Inference times are measured with open-sourced code on a specific GPU
  • Batch size is set to 6 for monocular paradigms

Computation-constrained assessment

  • Seven modern 3D detectors and two proposed baselines are analyzed under three platforms
  • Compared to offline evaluation, all 3D detectors suffer from performance drops on the ASAP benchmark
  • Performance degrades as computation power is increasingly constrained
  • Model rank alters under different computation performances
  • Future state estimation can compensate for inference time delay and improve streaming performance
  • High-speed objects particularly benefit from velocity-based updating strategy
  • Learning-based forecasting baseline built upon BEVDepth improves mAP-S, but not as much as velocity-based updating
  • Performance of 3D detectors drops when GPU is simultaneously processing classification tasks
  • Model latency and computation budget should be considered when optimizing practical deployment

Analysis on input size and backbone selection

  • Experiments were conducted to investigate the association between streaming performance and input size/backbone selection.
  • Results showed that BEVDepth@ResNet101 improved BEVDepth@ResNet50 by 3.4%.
  • High-resolution inputs and stronger backbones may hinder streaming performance due to high latency.

Conclusion

  • Proposed ASAP benchmark to evaluate online performance of vision-centric driving perception approaches
  • Extended 12Hz raw images of nuScenes dataset to introduce nuScenes-H dataset for camera-based streaming 3D detection
  • Established SPUR protocol for computation-constrained evaluation
  • Proposed ASAP baselines to compensate for inference time delay
  • Analyzed streaming performance of seven modern camera-based 3D detectors and two proposed baselines under various computation constraints
  • Model latency and computation budget should be regarded as design choices for practical deployment
  • Evaluation conducted with modern GPUs (e.g., NVIDIA RTX3090, NVIDIA RTX2070S, NVIDIA GTX1060)
  • Future work to evaluate with system-on-a-chip and 8-bit int/floating point precision acceleration
  • Future work to consider more autonomous-driving tasks in SPUR evaluation protocol
  • Velocity-based updating baseline to compensate for inference delay
  • Kalman filter refinement to benefit streaming perception