Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Learning powerful representations in BEV for perception tasks is gaining attention from industry and academia.
- Conventional approaches for autonomous driving algorithms use front or perspective view.
- BEV perception has advantages such as representing scenes intuitively and fusion-friendly.
- Core problems for BEV perception include reconstructing 3D information, acquiring ground truth annotations, formulating pipelines, and adapting algorithms.
- This survey reviews recent work on BEV perception, provides a practical guidebook, and points out future research directions.
Paper Content
Introduction
- Perception recognition task in autonomous driving is a 3D geometry reconstruction.
- Representing features from different views in a unified perspective is important.
- Bird’s-eye-view is a natural and straightforward candidate view.
- BEV representation has no occlusion or scale problem.
- BEV Perception is a vision algorithm in the BEV view representation for autonomous driving.
Big picture at a glance
- BEV perception research is divided into three parts: BEV camera, BEV LiDAR and BEV fusion
- BEV perception is a general task built on top of a series of fundamental tasks
- Different combinations of sensor input, fundamental task and product scenario can indicate a certain BEV perception algorithm
Motivation to bev perception research
Significance.
- BEV perception has potential to have real and meaningful impact on academia and society
- Performance gap between camera/vision and LiDAR/fusion based solutions is over 20-30%
- Academic perspective: understanding view transformation from 2D to 3D
- Industrial perspective: cheaper and accurate deployment of software algorithms
- BEV representation is one of the best candidates for LiDAR based methods
Space.
- BEV perception requires learning a robust and generalizable feature representation from both camera and LiDAR inputs
- Depth estimation from raw sensor inputs is difficult, especially for the camera branch
- Fusing features from multi-modality input is key and leaves space to innovate
Contributions
- Reviewed BEV perception research in recent years
- Analyzed BEV perception literature
- Provided practical cookbook for improving performance in BEV perception tasks
Background in 3d perception
- Conventional approaches for 3D perception tasks include monocular camera based 3D object detection, LiDAR based 3D object detection and segmentation, and sensor fusion strategies.
- Predominant datasets in 3D perception include KITTI, nuScenes and Waymo Open Dataset.
Task definition and related work
- Monocular camera-based object detection uses an RGB image to predict 3D location and category of objects
- LiDAR-based methods use a set of points in 3D space to capture geometry information of objects and outperform camera-based methods
- Sensor fusion combines data from different sensors (camera, LiDAR, Radar) to improve performance of perception system
Datasets and metrics
- Autonomous driving datasets and evaluation metrics are introduced
- Datasets consist of various scenes of different lengths
- 3D bounding box and 3D segmentation annotation are essential
- HD-Map configuration is a mainstream trend
- Multiple modes and various annotations are required
Evaluation metrics
- LET-3D-APL is a metric used in camera-only 3D detection instead of 3D-AP.
- LET-3D-APL penalizes longitudinal localization errors by scaling precision using localization affinity.
- mAP is similar to AP metric in 2D object detection, but matching strategy is replaced with 2D center distance on BEV plane.
Nds.
- NDS is a combination of several metrics
- NDS is computed by using the weightsum of the metrics, with mAP having a weight of 5 and the rest having a weight of 1
Methodology of bev perception
- BEV perception is divided into three settings based on input modality
- Tab. 2 summarizes the taxonomy of BEV perception literature
- Tab. 3 depicts the performance gain of 3D object detection and segmentation on popular leaderboards over the years
Method
Bev camera
General pipeline
- IPM maps pixels onto the BEV plane using the intrinsic and extrinsic matrix of cameras
- LSS is the first method to predict depth distribution of image features using neural networks
- Other works develop different methods to conduct view transformation
- General pipeline for fusing image and point cloud data includes modal-specific feature extractors and temporal and ego-motion information
Methods
Modality
- Attempts to solve 3D pretraining have been made
- Recent success in 2D vision Transformer can be transferred to 3D space
- Future investigation in 3D perception is possible
View transformation
- Recent research has focused on view transformation module
- 3D information is constructed from either 2D feature or 3D prior assumption
- View transformation plays a vital role in camera-only 3D perception
- View transformation can be divided into two aspects: 2D-3D and 3D-2D
- 2D-3D method predicts depth distribution per grid on 2D feature
- 3D-2D method projects 2D feature to voxel space
- Stereo methods use strong prior to obtain depth value/distribution
- 3D-2D method originated from Inverse Perspective Mapping (IPM)
- Cross-attention mechanism in transformer architecture models 3D-2D projection
- Grid sampler accelerates 3D-2D view transformation
Discussion on bev and perspective methods
- Camera-only 3D perception focuses on predicting 3D object localization from perspective view
- BEV representation is used to tackle the problem of objects with the same size in 3D space having different sizes on image plane
- Recent BEV-based methods have been successful due to the nuScenes dataset, help from LiDAR-based methods, and long-term development of monocular methods
- BEV-based methods and perspective methods are two different ways to reconstruct 3D information from 2D images
Bev lidar
Pre-bev feature extraction
- Point-based methods process raw point cloud, voxel-based methods voxelize points into grids
- 3D convolution or 3D sparse convolution used to extract point cloud features
- Features of each point calculated by linear layer, batch normalization, and activation function
- Feature of voxel is element-wise max-pooling of all points
- 3D convolution applied to aggregate local voxel features
- Feature maps transformed into BEV and processed by RPN to generate object proposals
- SECOND introduces sparse convolution to reduce training and inference speed
- CenterPoint is a powerful center-based anchor-free 3D detector
- PV-RCNN combines point and voxel branches to learn more discriminative point cloud features
- SA-SSD, Voxel R-CNN, Object DGCNN, VoTr, and SST all use different methods to process voxel features
- AFDetV2 formulates a single-stage anchor-free network
Post-bev feature extraction
- 3D convolution is inefficient for sparse and irregular voxels
- Suitable and efficient 3D detection networks are desirable
- MV3D converts point cloud data into a BEV representation
- Features of height, intensity, and density are obtained from points in the grid
- Other works follow similar pattern to represent point cloud using statistics in a BEV grid
- PointPillars introduces the concept of pillars and utilizes a simplified version of PointNet
- PointPillars and its variants have high efficiency and are suitable for industrial applications
Discussion
- Images and point clouds are in different coordinate systems.
- Point clouds can be projected onto image coordinates, but the sparse nature of point clouds makes it difficult to extract features.
- Images in perspective view can be transformed into 3D space, but lack of depth information makes it an ill-posed problem.
- BEV provides a unified representation for multisensor and temporal fusion.
- Ego-motion information can be used to compensate for temporal fusion in BEV space.
Bev fusion
Lidar-camera fusion
- BEVFusion [5,91] explores fusion in BEV from different directions
- BEVFusion [5] projects camera features into BEV and fuses with lidar BEV features
- BEV-Fusion [91] encodes camera and lidar features into same BEV to ensure independence
- UVTR [121] represents input modalities in modal-specific voxel spaces and conducts cross-modality interaction
Temporal fusion
- Temporal information is important for recognizing objects and occlusions.
- BEV provides a connection between scene representations in different timestamps.
- Works use ego-motion to align previous BEV features to current coordinates.
- Attention module is used to fuse temporal information from previous BEV feature maps and frames.
- Ego-motion information is used to correct locations for the attention module.
Industrial design of bev perception
- BEV perception is trending in the industry
- Two typical paradigms for sensor fusion in industrial applications
- Most autonomous driving companies used perspective view inputs
- BEV based methods use neural networks for 2D to 3D transformation
- Fig. 6 summarizes various BEV perception architectures proposed by corporations
- BEV fusion architectures follow the pipeline in Fig. 5b
Input data
- BEV based perception algorithms use multiple data modalities, including camera, LiDAR, Radar, IMU and GPS.
- Camera and LiDAR are the main perception sensors for autonomous driving.
- Some products use camera only, while others use a combination of camera and LiDAR.
- IMU and GPS signals are often used for sensor fusion plans.
Feature extractor
- Feature extractor transforms raw data into feature representations
- Feature extractor consists of backbone and neck
- Examples of backbones: ResNet, RegNet
- Examples of necks: FPN, BiFPN
- Backbones for point cloud input: pillar based option, voxel based choice
Pv to bev transformation
- Fixed IPM projects PV features to BEV space, but is sensitive to vehicle jolting and road flatness
- Adaptive IPM is robust to vehicle pose, but still assumes flat ground
- Transformer based BEV transformation is data driven and widely adopted
- ViDAR uses pixel-level depth to project PV feature to BEV space
Fusion module
- Alignment of camera sources achieved in BEV transformation module
- Fusion unit aggregates BEV features from camera and LiDAR
- Features from different modalities integrated into one unified form
Temporal & spatial module
- Features can be stacked temporally and spatially to create a feature queue.
- Features can be fused into a spatial-temporal BEV feature, which is robust to occlusion.
- Aggregation module can be 3D convolution, RNN or Transformer.
- Feature map surrounding ego vehicle can be maintained and updated locally.
Prediction head
- Multi-head design is widely used in BEV perception
- BEV feature aggregates information from all sensors
- 3D detection results are decoded from BEV feature space
- PV results are decoded from PV features
- Prediction results can be classified into three categories: low level, entity level, and structure level
Empirical evaluation and recipe
- Bag of tricks and useful practices can be used to achieve top results on various benchmarks.
- BEVFormer++ and Voxel-SPVCNN are two examples of this.
Data augmentation
Bev camera (camera-only) detection
- Common data augmentations used for 2D recognition tasks can be applied to camera based BEV perception.
- Augmentations can be divided into static (color variation) and spatial (moving pixels).
- Common augmentations used in recent work include color jitter, flip, multi-scale resize, rotation, crop and grid mask.
- BEVFormer++ uses color jitter, flip, multi-scale resize and grid mask.
- Images can be flipped in two ways: flipping image, ground truth and camera parameters, or flipping the whole 3D space symmetrically.
- Data augmentation is important for improving 3D model performance.
Lidar segmentaion
- Data augmentation can be used in segmentation tasks, including random rotation, scaling, flipping, and point translation
- Painting can be used to enhance point cloud data with image information
- Temporal information can be used to improve model performance
Bev encoder
Bev camera: bevformer++
- BEVFormer++ has multiple encoder layers with tailored designs
- BEV queries are grid-shaped learnable parameters used to query features in BEV space
- Spatial cross-attention and temporal self-attention are attention layers used to lookup and aggregate features
- Inference involves feeding multi-camera images to the backbone network and preserving BEV features from prior timestamp
- Encoder layers generate refined BEV features
- 3D detection head and map segmentation head predict perception results
- 2D feature extractor, view transformation, and temporal BEV fusion are important for feature quality
Bev lidar: voxel-spvcnn
- Existing 3D perception models are not suitable for recognizing small instances.
- SPVCNN uses Minkowski U-Net in the voxel-based branch and an extra point-based branch without downsampling.
- Voxel-SPVCNN is more efficient and brings an improvement of 1.1 mIoU.
3d detection head in bevformer++
- BEVFormer++ uses three detection heads to cover three categories of detector design
- Different types of detector heads are chosen to leverage detection frameworks in different scenarios
- DETR decoder is used with Smooth L1 loss, FreeAnchor and Center-Point are also used
- Ablation study shows different heads perform differently under various settings
Test-time augmentation (tta)
Bev camera-only detection
- BEV detection removes the burden of multi-camera object level fusion.
- Duplicate features are likely to be sampled on different BEV locations along a light ray to camera center.
- Leveraging 2D detection results for duplicate removal on 3D detection results can improve 3D detection performance.
Lidar segmentation
- Most misclassification occurs within similar classes
- Post-processing techniques can improve mIoU
- Existing segmentation methods do not consider consistency of single object
- Object-level refinement is conducted to improve object-level integrity
- Justification of object-level classification is performed by a lightweight classification network
- Time consistency of prediction is refined by tracking
Loss
Lidar segmentation
- Geo loss is used to train models and has a strong response to voxels with rich details.
- Lovász loss is used to mitigate class imbalance and improves model performance by 0.6 mIoU.
Ensemble
Post-processing
Conlusion
- BEV perception has been reviewed in recent years
- Grand challenges and future endeavors include: more accurate depth estimator, better feature representation fusion, parameter-free network, and incorporating successful knowledge
- Monocular camera-based 3D object detection uses RGB image and attempts to predict 3D location and category
- LiDAR-based 3D object detection and segmentation uses point clouds to capture geometry information
- Sensor-fusion-based 3D object detection combines data from multiple sensors
- Point-based methods process raw point cloud data for feature extraction
- LiDAR-based methods outperform camera-based methods due to depth prior
- Outdoor segmentation models are designed for more imbalance point distribution
- Autonomous vehicles use cameras, LiDAR, and Radar, each with advantages and disadvantages
- Sensor fusion pushes the performance upper bound of the perception system
- Fusion methods include early fusion, middle fusion, and late fusion