Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Autonomous driving systems are composed of modular tasks in sequential order.
- There is a trend to develop systems that can perform a wide variety of tasks.
- Contemporary approaches use either standalone models or multi-task paradigms.
- A favorable algorithm framework should be devised and optimized for planning.
- The key components of perception and prediction are analyzed and prioritized.
- Unified Autonomous Driving (UniAD) is a comprehensive framework that incorporates full-stack driving tasks.
- UniAD is proven to surpass previous state-of-the-arts by a large margin.
Paper Content
Introduction
- Deep learning has enabled the development of autonomous driving algorithms.
- Autonomous driving algorithms involve multiple tasks, including detection, tracking, mapping and motion/occupancy prediction.
- Most industry solutions use standalone models for each task.
- Multi-task learning (MTL) is a popular practice to incorporate multiple tasks into one model.
- UniAD is a goal-oriented system that prioritizes tasks in a hierarchical manner.
Methodology
- UniAD is a computer science pipeline comprised of four perception and prediction modules and one planner.
- Queries are used to connect the pipeline and model multi-agent and scene-level contexts.
- Features are transformed from perspective-view to a unified BEV feature.
- MotionFormer captures interactions among agents and maps and forecasts per-agent future trajectory.
- OccFormer predicts occupancy grid maps with agent identity preserved.
Perception: tracking and mapping
- TrackFormer performs joint detection and multiobject tracking without postprocessing
- TrackFormer introduces detection and track queries to model the life cycle of tracked agents
- TrackFormer contains N layers and the final output query provides rich historical knowledge of valid agents
- MapFormer is based on a 2D panoptic segmentation method
- MapFormer sets lanes, dividers, crossings as things queries and the drivable area as stuff query
- MapFormer has N stacked layers and only the updated query in the last layer is forwarded to MotionFormer
Prediction: motion forecasting
- Researchers have proven the effectiveness of transformer structure on motion task
- MotionFormer predicts all agents’ multimodal future movements in a scene-centric manner
- MotionFormer saves computational cost of aligning the whole scene to each agent’s coordinate
- MotionFormer preserves ego identity in the scene-level setting
- MotionFormer output is multiagent trajectories in the global frame in a single pass
- MotionFormer captures three types of interactions: agent-agent, agent-map and agent-goal point
- MotionFormer uses non-linear optimization to optimize target trajectories
Prediction: occupancy prediction
- Occupancy grid map is a discretized BEV representation
- Each cell holds a belief indicating whether it is occupied
- Occupancy prediction task is designed to discover how the grid map changes in the future
- Previous approaches use RNNs to temporally expand future predictions
- Generate per-agent occupancy masks through hand-crafted clustering postprocessing
- OccFormer incorporates both scene-level and agent-level semantics
- OccFormer is composed of T o sequential blocks
- Motion queries from Motion-Former are max-pooled in the modality dimension
- BEV feature is downscaled to 1/4 resolution
- Each block follows a downsample-upsample manner with an attention module
- Pixel-agent interaction unifies scene-and agent-level understanding
- Instance-level occupancy is generated via matrix multiplication
Planning
- Planning without HD maps or predefined routes requires a high-level command to indicate the direction to go.
- The ego-vehicle query is further endowed with the command intention.
- The final trajectory is predicted by attending to BEV features for road and traffic information.
- An optimization strategy based on Newton’s method is introduced during inference.
Joint learning
- UniAD is trained in two stages
- Shared matching policy is used for pairing predictions to ground truth set
- Bipartite matching is used in tracking and online mapping stage
- Matching results are used in motion and occupancy tasks to improve temporal consistency
Experiments
- Conducted experiments on nuScenes benchmark
- Details of experiments, visualizations, metrics and protocols provided in Supplementary
Main results
- UniAD yields a significant improvement in multi-object tracking compared to other methods
- UniAD shows excellent ability to segment road elements
- UniAD reduces errors in motion forecasting by 38.3% and 65.4% compared to other methods
- UniAD reduces planning L2 error and collision rate by 51.2% and 56.3% compared to ST-P3
Ablation study
- Goal-oriented design philosophy is validated through extensive ablations
- Motion forecasting and occupancy prediction modules are both necessary for safe planning
- Performance of both tasks is improved when they are closely integrated
- Perception modules contribute to motion forecasting performance
- Hierarchical design outperforms naive multi-task learning
Conclusion and future work
- Hierarchical, goal-oriented pipeline proposed for autonomous driving algorithm framework
- Perception and prediction modules analyzed
- Query-based design proposed to connect all nodes in UniAD
- Experiments verify proposed method in all aspects
- Limitations and future work discussed
- Detection and tracking discussed
- Online mapping discussed
- Motion forecasting discussed
- Occupancy prediction discussed
- Planning discussed
- End-to-end motion planning discussed
- Detection designs inherited from BEV-Former