Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Vision-centric perception has been used in autonomous driving tasks.
- Traditional benchmarks do not consider inference time delay.
- ASAP is the first benchmark to evaluate online performance of vision-centric perception.
- Annotation-extending pipeline is used to generate high-frame-rate labels.
- SPUR evaluation protocol is constructed to evaluate performance under different computational resources.
- Model rank alters under different constraints.
- Baselines for camera-based streaming 3D detection are established.
Paper Content
Introduction
- ASAP benchmark proposed to evaluate accuracy-latency trade-off of camera-based perception methods
- Annotation-extending pipeline proposed to annotate 12Hz raw images of nuScenes dataset
- Simple baselines established in ASAP benchmark to alleviate influence of inference delay
- SPUR evaluation protocol constructed to evaluate streaming performance of proposed baselines and 7 modern camera-based 3D detectors under various computational constraints
- Baseline BEVDepth-Sv consistently improves streaming performance on different platforms
- Model rank changes under different computational resources
- Offline performance cannot serve as deterministic criterion for different approaches
Related work
Autonomous-driving benchmark
- Last decade has seen progress in autonomous-driving perception
- Several benchmarks focus on 2D annotation
- Several benchmarks collect multi-modal data with 3D annotations
- Surround-view image data boosts camera-based 3D perception
- Vision-centric 3D perception trend shows promising accuracy
- Benchmarks evaluate perception methods in an offline manner, neglecting inference time delay
Vision-centric driving perception
- Cameras can be deployed with lower budgets than LiDAR
- Cameras can extract rich semantic information from color and texture
- BEV representation promotes development of vision-centric perception
- BEV-based methods have achieved promising detection accuracy, approaching LiDAR-based counterparts
- Runtime of most methods exceeds 300ms, not suitable for practical deployment
Streaming perception
- Streaming perception is a concept proposed in [34] to evaluate accuracy-latency trade-off of 2D detectors.
- Kalman filter, dynamic scheduling, and reinforcement learning are used to reduce problems caused by inference time delay.
- [71] simplifies streaming perception to predicting the next frame with an efficient detector.
- [25] and [17] use recurrent neural networks to process LiDAR slices.
- [6] uses polar-pillar representation for LiDAR slices.
- Streaming paradigm of vision-centric perception in autonomous driving is still under investigation.
The asap benchmark
- ASAP concept introduced
- Difficulty to evaluate streaming algorithms on original nuScenes dataset
- High frame-rate nuScenes-H dataset introduced
- SPUR evaluation protocol presented to assess streaming performance
- Simple baselines proposed to alleviate inference time delay in streaming detection
- ASAP benchmark evaluates most recent prediction if processing of current frame is not finished
Autonomous-driving streaming perception
- Streaming paradigm evaluates performance at every input timestamp
- Inputs are surround-view images at timestamp t i
- Predictions are not synchronized with input timestamps
- Evaluation metric is L(โข)
- Evaluation is illustrated in Fig. 2
Nuscenes-h
- The nuScenes dataset is a popular benchmark for autonomous driving perception.
- Annotation frame rate of the original nuScenes dataset is 2Hz, which is slower than the inference speed of most camera-based 3D detectors.
- An annotation-extending pipeline is proposed to annotate the 12Hz raw images.
- An object interpolation is used to calculate intermediate annotations.
Spur evaluation protocol
- Designed the Streaming Perception Under constRained-computation (SPUR) evaluation protocol to investigate streaming performance of 3D detectors
- Introduced streaming metrics such as Average Translation Error (ATE), Average Scale Error (ASE), Average Orientation Error (AOE), Average Attribute Error (AAE), NuScenes Detection Score (NDS) and mean Average Precision (mAP)
- Investigated two computation-constrained evaluation protocols: varying platforms and sharing computational resources
Asap baselines
- ASAP benchmark evaluates most recent predictions if current computation is not finished, resulting in mismatch between previously processed observation and current one.
- To mitigate mismatch problem, forecasting future state is a simple solution.
- Establish velocity-based baseline to update future states by predicted object motion.
- Investigate learning-based baseline to directly estimate future locations of objects.
- Velocity-based updating strategy is applied to any modern 3D detector.
- Learning-based forecasting baseline directly forecasts future locations of objects.
Streaming evaluation on asap benchmark
- Experiment setup is given
- Computation-constrained assessment is discussed
- Streaming evaluation with platforms altering and resource sharing is discussed
- Analysis of association between streaming performance and input size/backbone selection is done
Experiment setup
- ASAP benchmark uses camera-based 3D detection for vision-centric perception
- nuScenes-H dataset is used to evaluate 3D detectors
- Evaluation is conducted with hardware-dependent simulator
- Inference times are measured with open-sourced code on a specific GPU
- Batch size is set to 6 for monocular paradigms
Computation-constrained assessment
- Seven modern 3D detectors and two proposed baselines are analyzed under three platforms
- Compared to offline evaluation, all 3D detectors suffer from performance drops on the ASAP benchmark
- Performance degrades as computation power is increasingly constrained
- Model rank alters under different computation performances
- Future state estimation can compensate for inference time delay and improve streaming performance
- High-speed objects particularly benefit from velocity-based updating strategy
- Learning-based forecasting baseline built upon BEVDepth improves mAP-S, but not as much as velocity-based updating
- Performance of 3D detectors drops when GPU is simultaneously processing classification tasks
- Model latency and computation budget should be considered when optimizing practical deployment
Analysis on input size and backbone selection
- Experiments were conducted to investigate the association between streaming performance and input size/backbone selection.
- Results showed that BEVDepth@ResNet101 improved BEVDepth@ResNet50 by 3.4%.
- High-resolution inputs and stronger backbones may hinder streaming performance due to high latency.
Conclusion
- Proposed ASAP benchmark to evaluate online performance of vision-centric driving perception approaches
- Extended 12Hz raw images of nuScenes dataset to introduce nuScenes-H dataset for camera-based streaming 3D detection
- Established SPUR protocol for computation-constrained evaluation
- Proposed ASAP baselines to compensate for inference time delay
- Analyzed streaming performance of seven modern camera-based 3D detectors and two proposed baselines under various computation constraints
- Model latency and computation budget should be regarded as design choices for practical deployment
- Evaluation conducted with modern GPUs (e.g., NVIDIA RTX3090, NVIDIA RTX2070S, NVIDIA GTX1060)
- Future work to evaluate with system-on-a-chip and 8-bit int/floating point precision acceleration
- Future work to consider more autonomous-driving tasks in SPUR evaluation protocol
- Velocity-based updating baseline to compensate for inference delay
- Kalman filter refinement to benefit streaming perception