Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Autonomous driving systems are composed of modular tasks in sequential order.
  • There is a trend to develop systems that can perform a wide variety of tasks.
  • Contemporary approaches use either standalone models or multi-task paradigms.
  • A favorable algorithm framework should be devised and optimized for planning.
  • The key components of perception and prediction are analyzed and prioritized.
  • Unified Autonomous Driving (UniAD) is a comprehensive framework that incorporates full-stack driving tasks.
  • UniAD is proven to surpass previous state-of-the-arts by a large margin.

Paper Content

Introduction

  • Deep learning has enabled the development of autonomous driving algorithms.
  • Autonomous driving algorithms involve multiple tasks, including detection, tracking, mapping and motion/occupancy prediction.
  • Most industry solutions use standalone models for each task.
  • Multi-task learning (MTL) is a popular practice to incorporate multiple tasks into one model.
  • UniAD is a goal-oriented system that prioritizes tasks in a hierarchical manner.

Methodology

  • UniAD is a computer science pipeline comprised of four perception and prediction modules and one planner.
  • Queries are used to connect the pipeline and model multi-agent and scene-level contexts.
  • Features are transformed from perspective-view to a unified BEV feature.
  • MotionFormer captures interactions among agents and maps and forecasts per-agent future trajectory.
  • OccFormer predicts occupancy grid maps with agent identity preserved.

Perception: tracking and mapping

  • TrackFormer performs joint detection and multiobject tracking without postprocessing
  • TrackFormer introduces detection and track queries to model the life cycle of tracked agents
  • TrackFormer contains N layers and the final output query provides rich historical knowledge of valid agents
  • MapFormer is based on a 2D panoptic segmentation method
  • MapFormer sets lanes, dividers, crossings as things queries and the drivable area as stuff query
  • MapFormer has N stacked layers and only the updated query in the last layer is forwarded to MotionFormer

Prediction: motion forecasting

  • Researchers have proven the effectiveness of transformer structure on motion task
  • MotionFormer predicts all agents’ multimodal future movements in a scene-centric manner
  • MotionFormer saves computational cost of aligning the whole scene to each agent’s coordinate
  • MotionFormer preserves ego identity in the scene-level setting
  • MotionFormer output is multiagent trajectories in the global frame in a single pass
  • MotionFormer captures three types of interactions: agent-agent, agent-map and agent-goal point
  • MotionFormer uses non-linear optimization to optimize target trajectories

Prediction: occupancy prediction

  • Occupancy grid map is a discretized BEV representation
  • Each cell holds a belief indicating whether it is occupied
  • Occupancy prediction task is designed to discover how the grid map changes in the future
  • Previous approaches use RNNs to temporally expand future predictions
  • Generate per-agent occupancy masks through hand-crafted clustering postprocessing
  • OccFormer incorporates both scene-level and agent-level semantics
  • OccFormer is composed of T o sequential blocks
  • Motion queries from Motion-Former are max-pooled in the modality dimension
  • BEV feature is downscaled to 1/4 resolution
  • Each block follows a downsample-upsample manner with an attention module
  • Pixel-agent interaction unifies scene-and agent-level understanding
  • Instance-level occupancy is generated via matrix multiplication

Planning

  • Planning without HD maps or predefined routes requires a high-level command to indicate the direction to go.
  • The ego-vehicle query is further endowed with the command intention.
  • The final trajectory is predicted by attending to BEV features for road and traffic information.
  • An optimization strategy based on Newton’s method is introduced during inference.

Joint learning

  • UniAD is trained in two stages
  • Shared matching policy is used for pairing predictions to ground truth set
  • Bipartite matching is used in tracking and online mapping stage
  • Matching results are used in motion and occupancy tasks to improve temporal consistency

Experiments

  • Conducted experiments on nuScenes benchmark
  • Details of experiments, visualizations, metrics and protocols provided in Supplementary

Main results

  • UniAD yields a significant improvement in multi-object tracking compared to other methods
  • UniAD shows excellent ability to segment road elements
  • UniAD reduces errors in motion forecasting by 38.3% and 65.4% compared to other methods
  • UniAD reduces planning L2 error and collision rate by 51.2% and 56.3% compared to ST-P3

Ablation study

  • Goal-oriented design philosophy is validated through extensive ablations
  • Motion forecasting and occupancy prediction modules are both necessary for safe planning
  • Performance of both tasks is improved when they are closely integrated
  • Perception modules contribute to motion forecasting performance
  • Hierarchical design outperforms naive multi-task learning

Conclusion and future work

  • Hierarchical, goal-oriented pipeline proposed for autonomous driving algorithm framework
  • Perception and prediction modules analyzed
  • Query-based design proposed to connect all nodes in UniAD
  • Experiments verify proposed method in all aspects
  • Limitations and future work discussed
  • Detection and tracking discussed
  • Online mapping discussed
  • Motion forecasting discussed
  • Occupancy prediction discussed
  • Planning discussed
  • End-to-end motion planning discussed
  • Detection designs inherited from BEV-Former