Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Learning powerful representations in BEV for perception tasks is gaining attention from industry and academia.
  • Conventional approaches for autonomous driving algorithms use front or perspective view.
  • BEV perception has advantages such as representing scenes intuitively and fusion-friendly.
  • Core problems for BEV perception include reconstructing 3D information, acquiring ground truth annotations, formulating pipelines, and adapting algorithms.
  • This survey reviews recent work on BEV perception, provides a practical guidebook, and points out future research directions.

Paper Content

Introduction

  • Perception recognition task in autonomous driving is a 3D geometry reconstruction.
  • Representing features from different views in a unified perspective is important.
  • Bird’s-eye-view is a natural and straightforward candidate view.
  • BEV representation has no occlusion or scale problem.
  • BEV Perception is a vision algorithm in the BEV view representation for autonomous driving.

Big picture at a glance

  • BEV perception research is divided into three parts: BEV camera, BEV LiDAR and BEV fusion
  • BEV perception is a general task built on top of a series of fundamental tasks
  • Different combinations of sensor input, fundamental task and product scenario can indicate a certain BEV perception algorithm

Motivation to bev perception research

Significance.

  • BEV perception has potential to have real and meaningful impact on academia and society
  • Performance gap between camera/vision and LiDAR/fusion based solutions is over 20-30%
  • Academic perspective: understanding view transformation from 2D to 3D
  • Industrial perspective: cheaper and accurate deployment of software algorithms
  • BEV representation is one of the best candidates for LiDAR based methods

Space.

  • BEV perception requires learning a robust and generalizable feature representation from both camera and LiDAR inputs
  • Depth estimation from raw sensor inputs is difficult, especially for the camera branch
  • Fusing features from multi-modality input is key and leaves space to innovate

Contributions

  • Reviewed BEV perception research in recent years
  • Analyzed BEV perception literature
  • Provided practical cookbook for improving performance in BEV perception tasks

Background in 3d perception

  • Conventional approaches for 3D perception tasks include monocular camera based 3D object detection, LiDAR based 3D object detection and segmentation, and sensor fusion strategies.
  • Predominant datasets in 3D perception include KITTI, nuScenes and Waymo Open Dataset.
  • Monocular camera-based object detection uses an RGB image to predict 3D location and category of objects
  • LiDAR-based methods use a set of points in 3D space to capture geometry information of objects and outperform camera-based methods
  • Sensor fusion combines data from different sensors (camera, LiDAR, Radar) to improve performance of perception system

Datasets and metrics

  • Autonomous driving datasets and evaluation metrics are introduced
  • Datasets consist of various scenes of different lengths
  • 3D bounding box and 3D segmentation annotation are essential
  • HD-Map configuration is a mainstream trend
  • Multiple modes and various annotations are required

Evaluation metrics

  • LET-3D-APL is a metric used in camera-only 3D detection instead of 3D-AP.
  • LET-3D-APL penalizes longitudinal localization errors by scaling precision using localization affinity.
  • mAP is similar to AP metric in 2D object detection, but matching strategy is replaced with 2D center distance on BEV plane.

Nds.

  • NDS is a combination of several metrics
  • NDS is computed by using the weightsum of the metrics, with mAP having a weight of 5 and the rest having a weight of 1

Methodology of bev perception

  • BEV perception is divided into three settings based on input modality
  • Tab. 2 summarizes the taxonomy of BEV perception literature
  • Tab. 3 depicts the performance gain of 3D object detection and segmentation on popular leaderboards over the years

Method

Bev camera

General pipeline

  • IPM maps pixels onto the BEV plane using the intrinsic and extrinsic matrix of cameras
  • LSS is the first method to predict depth distribution of image features using neural networks
  • Other works develop different methods to conduct view transformation
  • General pipeline for fusing image and point cloud data includes modal-specific feature extractors and temporal and ego-motion information

Methods

Modality

  • Attempts to solve 3D pretraining have been made
  • Recent success in 2D vision Transformer can be transferred to 3D space
  • Future investigation in 3D perception is possible

View transformation

  • Recent research has focused on view transformation module
  • 3D information is constructed from either 2D feature or 3D prior assumption
  • View transformation plays a vital role in camera-only 3D perception
  • View transformation can be divided into two aspects: 2D-3D and 3D-2D
  • 2D-3D method predicts depth distribution per grid on 2D feature
  • 3D-2D method projects 2D feature to voxel space
  • Stereo methods use strong prior to obtain depth value/distribution
  • 3D-2D method originated from Inverse Perspective Mapping (IPM)
  • Cross-attention mechanism in transformer architecture models 3D-2D projection
  • Grid sampler accelerates 3D-2D view transformation

Discussion on bev and perspective methods

  • Camera-only 3D perception focuses on predicting 3D object localization from perspective view
  • BEV representation is used to tackle the problem of objects with the same size in 3D space having different sizes on image plane
  • Recent BEV-based methods have been successful due to the nuScenes dataset, help from LiDAR-based methods, and long-term development of monocular methods
  • BEV-based methods and perspective methods are two different ways to reconstruct 3D information from 2D images

Bev lidar

Pre-bev feature extraction

  • Point-based methods process raw point cloud, voxel-based methods voxelize points into grids
  • 3D convolution or 3D sparse convolution used to extract point cloud features
  • Features of each point calculated by linear layer, batch normalization, and activation function
  • Feature of voxel is element-wise max-pooling of all points
  • 3D convolution applied to aggregate local voxel features
  • Feature maps transformed into BEV and processed by RPN to generate object proposals
  • SECOND introduces sparse convolution to reduce training and inference speed
  • CenterPoint is a powerful center-based anchor-free 3D detector
  • PV-RCNN combines point and voxel branches to learn more discriminative point cloud features
  • SA-SSD, Voxel R-CNN, Object DGCNN, VoTr, and SST all use different methods to process voxel features
  • AFDetV2 formulates a single-stage anchor-free network

Post-bev feature extraction

  • 3D convolution is inefficient for sparse and irregular voxels
  • Suitable and efficient 3D detection networks are desirable
  • MV3D converts point cloud data into a BEV representation
  • Features of height, intensity, and density are obtained from points in the grid
  • Other works follow similar pattern to represent point cloud using statistics in a BEV grid
  • PointPillars introduces the concept of pillars and utilizes a simplified version of PointNet
  • PointPillars and its variants have high efficiency and are suitable for industrial applications

Discussion

  • Images and point clouds are in different coordinate systems.
  • Point clouds can be projected onto image coordinates, but the sparse nature of point clouds makes it difficult to extract features.
  • Images in perspective view can be transformed into 3D space, but lack of depth information makes it an ill-posed problem.
  • BEV provides a unified representation for multisensor and temporal fusion.
  • Ego-motion information can be used to compensate for temporal fusion in BEV space.

Bev fusion

Lidar-camera fusion

  • BEVFusion [5,91] explores fusion in BEV from different directions
  • BEVFusion [5] projects camera features into BEV and fuses with lidar BEV features
  • BEV-Fusion [91] encodes camera and lidar features into same BEV to ensure independence
  • UVTR [121] represents input modalities in modal-specific voxel spaces and conducts cross-modality interaction

Temporal fusion

  • Temporal information is important for recognizing objects and occlusions.
  • BEV provides a connection between scene representations in different timestamps.
  • Works use ego-motion to align previous BEV features to current coordinates.
  • Attention module is used to fuse temporal information from previous BEV feature maps and frames.
  • Ego-motion information is used to correct locations for the attention module.

Industrial design of bev perception

  • BEV perception is trending in the industry
  • Two typical paradigms for sensor fusion in industrial applications
  • Most autonomous driving companies used perspective view inputs
  • BEV based methods use neural networks for 2D to 3D transformation
  • Fig. 6 summarizes various BEV perception architectures proposed by corporations
  • BEV fusion architectures follow the pipeline in Fig. 5b

Input data

  • BEV based perception algorithms use multiple data modalities, including camera, LiDAR, Radar, IMU and GPS.
  • Camera and LiDAR are the main perception sensors for autonomous driving.
  • Some products use camera only, while others use a combination of camera and LiDAR.
  • IMU and GPS signals are often used for sensor fusion plans.

Feature extractor

  • Feature extractor transforms raw data into feature representations
  • Feature extractor consists of backbone and neck
  • Examples of backbones: ResNet, RegNet
  • Examples of necks: FPN, BiFPN
  • Backbones for point cloud input: pillar based option, voxel based choice

Pv to bev transformation

  • Fixed IPM projects PV features to BEV space, but is sensitive to vehicle jolting and road flatness
  • Adaptive IPM is robust to vehicle pose, but still assumes flat ground
  • Transformer based BEV transformation is data driven and widely adopted
  • ViDAR uses pixel-level depth to project PV feature to BEV space

Fusion module

  • Alignment of camera sources achieved in BEV transformation module
  • Fusion unit aggregates BEV features from camera and LiDAR
  • Features from different modalities integrated into one unified form

Temporal & spatial module

  • Features can be stacked temporally and spatially to create a feature queue.
  • Features can be fused into a spatial-temporal BEV feature, which is robust to occlusion.
  • Aggregation module can be 3D convolution, RNN or Transformer.
  • Feature map surrounding ego vehicle can be maintained and updated locally.

Prediction head

  • Multi-head design is widely used in BEV perception
  • BEV feature aggregates information from all sensors
  • 3D detection results are decoded from BEV feature space
  • PV results are decoded from PV features
  • Prediction results can be classified into three categories: low level, entity level, and structure level

Empirical evaluation and recipe

  • Bag of tricks and useful practices can be used to achieve top results on various benchmarks.
  • BEVFormer++ and Voxel-SPVCNN are two examples of this.

Data augmentation

Bev camera (camera-only) detection

  • Common data augmentations used for 2D recognition tasks can be applied to camera based BEV perception.
  • Augmentations can be divided into static (color variation) and spatial (moving pixels).
  • Common augmentations used in recent work include color jitter, flip, multi-scale resize, rotation, crop and grid mask.
  • BEVFormer++ uses color jitter, flip, multi-scale resize and grid mask.
  • Images can be flipped in two ways: flipping image, ground truth and camera parameters, or flipping the whole 3D space symmetrically.
  • Data augmentation is important for improving 3D model performance.

Lidar segmentaion

  • Data augmentation can be used in segmentation tasks, including random rotation, scaling, flipping, and point translation
  • Painting can be used to enhance point cloud data with image information
  • Temporal information can be used to improve model performance

Bev encoder

Bev camera: bevformer++

  • BEVFormer++ has multiple encoder layers with tailored designs
  • BEV queries are grid-shaped learnable parameters used to query features in BEV space
  • Spatial cross-attention and temporal self-attention are attention layers used to lookup and aggregate features
  • Inference involves feeding multi-camera images to the backbone network and preserving BEV features from prior timestamp
  • Encoder layers generate refined BEV features
  • 3D detection head and map segmentation head predict perception results
  • 2D feature extractor, view transformation, and temporal BEV fusion are important for feature quality

Bev lidar: voxel-spvcnn

  • Existing 3D perception models are not suitable for recognizing small instances.
  • SPVCNN uses Minkowski U-Net in the voxel-based branch and an extra point-based branch without downsampling.
  • Voxel-SPVCNN is more efficient and brings an improvement of 1.1 mIoU.

3d detection head in bevformer++

  • BEVFormer++ uses three detection heads to cover three categories of detector design
  • Different types of detector heads are chosen to leverage detection frameworks in different scenarios
  • DETR decoder is used with Smooth L1 loss, FreeAnchor and Center-Point are also used
  • Ablation study shows different heads perform differently under various settings

Test-time augmentation (tta)

Bev camera-only detection

  • BEV detection removes the burden of multi-camera object level fusion.
  • Duplicate features are likely to be sampled on different BEV locations along a light ray to camera center.
  • Leveraging 2D detection results for duplicate removal on 3D detection results can improve 3D detection performance.

Lidar segmentation

  • Most misclassification occurs within similar classes
  • Post-processing techniques can improve mIoU
  • Existing segmentation methods do not consider consistency of single object
  • Object-level refinement is conducted to improve object-level integrity
  • Justification of object-level classification is performed by a lightweight classification network
  • Time consistency of prediction is refined by tracking

Loss

Lidar segmentation

  • Geo loss is used to train models and has a strong response to voxels with rich details.
  • Lovász loss is used to mitigate class imbalance and improves model performance by 0.6 mIoU.

Ensemble

Post-processing

Conlusion

  • BEV perception has been reviewed in recent years
  • Grand challenges and future endeavors include: more accurate depth estimator, better feature representation fusion, parameter-free network, and incorporating successful knowledge
  • Monocular camera-based 3D object detection uses RGB image and attempts to predict 3D location and category
  • LiDAR-based 3D object detection and segmentation uses point clouds to capture geometry information
  • Sensor-fusion-based 3D object detection combines data from multiple sensors
  • Point-based methods process raw point cloud data for feature extraction
  • LiDAR-based methods outperform camera-based methods due to depth prior
  • Outdoor segmentation models are designed for more imbalance point distribution
  • Autonomous vehicles use cameras, LiDAR, and Radar, each with advantages and disadvantages
  • Sensor fusion pushes the performance upper bound of the perception system
  • Fusion methods include early fusion, middle fusion, and late fusion