Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Learning powerful representations in BEV for perception tasks is gaining attention from industry and academia.
Conventional approaches for autonomous driving algorithms use front or perspective view.
BEV perception has advantages such as representing scenes intuitively and fusion-friendly.
Core problems for BEV perception include reconstructing 3D information, acquiring ground truth annotations, formulating pipelines, and adapting algorithms.
This survey reviews recent work on BEV perception, provides a practical guidebook, and points out future research directions.

Paper Content

Introduction

Perception recognition task in autonomous driving is a 3D geometry reconstruction.
Representing features from different views in a unified perspective is important.
Bird’s-eye-view is a natural and straightforward candidate view.
BEV representation has no occlusion or scale problem.
BEV Perception is a vision algorithm in the BEV view representation for autonomous driving.

Big picture at a glance

BEV perception research is divided into three parts: BEV camera, BEV LiDAR and BEV fusion
BEV perception is a general task built on top of a series of fundamental tasks
Different combinations of sensor input, fundamental task and product scenario can indicate a certain BEV perception algorithm

Motivation to bev perception research

Significance.

BEV perception has potential to have real and meaningful impact on academia and society
Performance gap between camera/vision and LiDAR/fusion based solutions is over 20-30%
Academic perspective: understanding view transformation from 2D to 3D
Industrial perspective: cheaper and accurate deployment of software algorithms
BEV representation is one of the best candidates for LiDAR based methods

Space.

BEV perception requires learning a robust and generalizable feature representation from both camera and LiDAR inputs
Depth estimation from raw sensor inputs is difficult, especially for the camera branch
Fusing features from multi-modality input is key and leaves space to innovate

Contributions

Reviewed BEV perception research in recent years
Analyzed BEV perception literature
Provided practical cookbook for improving performance in BEV perception tasks

Background in 3d perception

Conventional approaches for 3D perception tasks include monocular camera based 3D object detection, LiDAR based 3D object detection and segmentation, and sensor fusion strategies.
Predominant datasets in 3D perception include KITTI, nuScenes and Waymo Open Dataset.

Monocular camera-based object detection uses an RGB image to predict 3D location and category of objects
LiDAR-based methods use a set of points in 3D space to capture geometry information of objects and outperform camera-based methods
Sensor fusion combines data from different sensors (camera, LiDAR, Radar) to improve performance of perception system

Datasets and metrics

Autonomous driving datasets and evaluation metrics are introduced
Datasets consist of various scenes of different lengths
3D bounding box and 3D segmentation annotation are essential
HD-Map configuration is a mainstream trend
Multiple modes and various annotations are required

Evaluation metrics

LET-3D-APL is a metric used in camera-only 3D detection instead of 3D-AP.
LET-3D-APL penalizes longitudinal localization errors by scaling precision using localization affinity.
mAP is similar to AP metric in 2D object detection, but matching strategy is replaced with 2D center distance on BEV plane.

Nds.

NDS is a combination of several metrics
NDS is computed by using the weightsum of the metrics, with mAP having a weight of 5 and the rest having a weight of 1

Methodology of bev perception

BEV perception is divided into three settings based on input modality
Tab. 2 summarizes the taxonomy of BEV perception literature
Tab. 3 depicts the performance gain of 3D object detection and segmentation on popular leaderboards over the years

Method

Bev camera

General pipeline

IPM maps pixels onto the BEV plane using the intrinsic and extrinsic matrix of cameras
LSS is the first method to predict depth distribution of image features using neural networks
Other works develop different methods to conduct view transformation
General pipeline for fusing image and point cloud data includes modal-specific feature extractors and temporal and ego-motion information

Methods

Modality

Attempts to solve 3D pretraining have been made
Recent success in 2D vision Transformer can be transferred to 3D space
Future investigation in 3D perception is possible

View transformation

Recent research has focused on view transformation module
3D information is constructed from either 2D feature or 3D prior assumption
View transformation plays a vital role in camera-only 3D perception
View transformation can be divided into two aspects: 2D-3D and 3D-2D
2D-3D method predicts depth distribution per grid on 2D feature
3D-2D method projects 2D feature to voxel space
Stereo methods use strong prior to obtain depth value/distribution
3D-2D method originated from Inverse Perspective Mapping (IPM)
Cross-attention mechanism in transformer architecture models 3D-2D projection
Grid sampler accelerates 3D-2D view transformation

Discussion on bev and perspective methods

Camera-only 3D perception focuses on predicting 3D object localization from perspective view
BEV representation is used to tackle the problem of objects with the same size in 3D space having different sizes on image plane
Recent BEV-based methods have been successful due to the nuScenes dataset, help from LiDAR-based methods, and long-term development of monocular methods
BEV-based methods and perspective methods are two different ways to reconstruct 3D information from 2D images

Bev lidar

Pre-bev feature extraction

Point-based methods process raw point cloud, voxel-based methods voxelize points into grids
3D convolution or 3D sparse convolution used to extract point cloud features
Features of each point calculated by linear layer, batch normalization, and activation function
Feature of voxel is element-wise max-pooling of all points
3D convolution applied to aggregate local voxel features
Feature maps transformed into BEV and processed by RPN to generate object proposals
SECOND introduces sparse convolution to reduce training and inference speed
CenterPoint is a powerful center-based anchor-free 3D detector
PV-RCNN combines point and voxel branches to learn more discriminative point cloud features
SA-SSD, Voxel R-CNN, Object DGCNN, VoTr, and SST all use different methods to process voxel features
AFDetV2 formulates a single-stage anchor-free network

Post-bev feature extraction

3D convolution is inefficient for sparse and irregular voxels
Suitable and efficient 3D detection networks are desirable
MV3D converts point cloud data into a BEV representation
Features of height, intensity, and density are obtained from points in the grid
Other works follow similar pattern to represent point cloud using statistics in a BEV grid
PointPillars introduces the concept of pillars and utilizes a simplified version of PointNet
PointPillars and its variants have high efficiency and are suitable for industrial applications

Discussion

Images and point clouds are in different coordinate systems.
Point clouds can be projected onto image coordinates, but the sparse nature of point clouds makes it difficult to extract features.
Images in perspective view can be transformed into 3D space, but lack of depth information makes it an ill-posed problem.
BEV provides a unified representation for multisensor and temporal fusion.
Ego-motion information can be used to compensate for temporal fusion in BEV space.

Bev fusion

Lidar-camera fusion

BEVFusion [5,91] explores fusion in BEV from different directions
BEVFusion [5] projects camera features into BEV and fuses with lidar BEV features
BEV-Fusion [91] encodes camera and lidar features into same BEV to ensure independence
UVTR [121] represents input modalities in modal-specific voxel spaces and conducts cross-modality interaction

Temporal fusion

Temporal information is important for recognizing objects and occlusions.
BEV provides a connection between scene representations in different timestamps.
Works use ego-motion to align previous BEV features to current coordinates.
Attention module is used to fuse temporal information from previous BEV feature maps and frames.
Ego-motion information is used to correct locations for the attention module.

Industrial design of bev perception

BEV perception is trending in the industry
Two typical paradigms for sensor fusion in industrial applications
Most autonomous driving companies used perspective view inputs
BEV based methods use neural networks for 2D to 3D transformation
Fig. 6 summarizes various BEV perception architectures proposed by corporations
BEV fusion architectures follow the pipeline in Fig. 5b

Input data

BEV based perception algorithms use multiple data modalities, including camera, LiDAR, Radar, IMU and GPS.
Camera and LiDAR are the main perception sensors for autonomous driving.
Some products use camera only, while others use a combination of camera and LiDAR.
IMU and GPS signals are often used for sensor fusion plans.

Feature extractor

Feature extractor transforms raw data into feature representations
Feature extractor consists of backbone and neck
Examples of backbones: ResNet, RegNet
Examples of necks: FPN, BiFPN
Backbones for point cloud input: pillar based option, voxel based choice

Pv to bev transformation

Fixed IPM projects PV features to BEV space, but is sensitive to vehicle jolting and road flatness
Adaptive IPM is robust to vehicle pose, but still assumes flat ground
Transformer based BEV transformation is data driven and widely adopted
ViDAR uses pixel-level depth to project PV feature to BEV space

Fusion module

Alignment of camera sources achieved in BEV transformation module
Fusion unit aggregates BEV features from camera and LiDAR
Features from different modalities integrated into one unified form

Temporal & spatial module

Features can be stacked temporally and spatially to create a feature queue.
Features can be fused into a spatial-temporal BEV feature, which is robust to occlusion.
Aggregation module can be 3D convolution, RNN or Transformer.
Feature map surrounding ego vehicle can be maintained and updated locally.

Prediction head

Multi-head design is widely used in BEV perception
BEV feature aggregates information from all sensors
3D detection results are decoded from BEV feature space
PV results are decoded from PV features
Prediction results can be classified into three categories: low level, entity level, and structure level

Empirical evaluation and recipe

Bag of tricks and useful practices can be used to achieve top results on various benchmarks.
BEVFormer++ and Voxel-SPVCNN are two examples of this.

Data augmentation

Bev camera (camera-only) detection

Common data augmentations used for 2D recognition tasks can be applied to camera based BEV perception.
Augmentations can be divided into static (color variation) and spatial (moving pixels).
Common augmentations used in recent work include color jitter, flip, multi-scale resize, rotation, crop and grid mask.
BEVFormer++ uses color jitter, flip, multi-scale resize and grid mask.
Images can be flipped in two ways: flipping image, ground truth and camera parameters, or flipping the whole 3D space symmetrically.
Data augmentation is important for improving 3D model performance.

Lidar segmentaion

Data augmentation can be used in segmentation tasks, including random rotation, scaling, flipping, and point translation
Painting can be used to enhance point cloud data with image information
Temporal information can be used to improve model performance

Bev encoder

Bev camera: bevformer++

BEVFormer++ has multiple encoder layers with tailored designs
BEV queries are grid-shaped learnable parameters used to query features in BEV space
Spatial cross-attention and temporal self-attention are attention layers used to lookup and aggregate features
Inference involves feeding multi-camera images to the backbone network and preserving BEV features from prior timestamp
Encoder layers generate refined BEV features
3D detection head and map segmentation head predict perception results
2D feature extractor, view transformation, and temporal BEV fusion are important for feature quality

Bev lidar: voxel-spvcnn

Existing 3D perception models are not suitable for recognizing small instances.
SPVCNN uses Minkowski U-Net in the voxel-based branch and an extra point-based branch without downsampling.
Voxel-SPVCNN is more efficient and brings an improvement of 1.1 mIoU.

3d detection head in bevformer++

BEVFormer++ uses three detection heads to cover three categories of detector design
Different types of detector heads are chosen to leverage detection frameworks in different scenarios
DETR decoder is used with Smooth L1 loss, FreeAnchor and Center-Point are also used
Ablation study shows different heads perform differently under various settings

Test-time augmentation (tta)

Bev camera-only detection

BEV detection removes the burden of multi-camera object level fusion.
Duplicate features are likely to be sampled on different BEV locations along a light ray to camera center.
Leveraging 2D detection results for duplicate removal on 3D detection results can improve 3D detection performance.

Lidar segmentation

Most misclassification occurs within similar classes
Post-processing techniques can improve mIoU
Existing segmentation methods do not consider consistency of single object
Object-level refinement is conducted to improve object-level integrity
Justification of object-level classification is performed by a lightweight classification network
Time consistency of prediction is refined by tracking

Loss

Lidar segmentation

Geo loss is used to train models and has a strong response to voxels with rich details.
Lovász loss is used to mitigate class imbalance and improves model performance by 0.6 mIoU.

Ensemble

Post-processing

Conlusion

BEV perception has been reviewed in recent years
Grand challenges and future endeavors include: more accurate depth estimator, better feature representation fusion, parameter-free network, and incorporating successful knowledge
Monocular camera-based 3D object detection uses RGB image and attempts to predict 3D location and category
LiDAR-based 3D object detection and segmentation uses point clouds to capture geometry information
Sensor-fusion-based 3D object detection combines data from multiple sensors
Point-based methods process raw point cloud data for feature extraction
LiDAR-based methods outperform camera-based methods due to depth prior
Outdoor segmentation models are designed for more imbalance point distribution
Autonomous vehicles use cameras, LiDAR, and Radar, each with advantages and disadvantages
Sensor fusion pushes the performance upper bound of the perception system
Fusion methods include early fusion, middle fusion, and late fusion

Link to paper#

Abstract#

Paper Content#

Introduction#

Big picture at a glance#

Motivation to bev perception research#

Significance.#

Space.#

Contributions#

Background in 3d perception#

Task definition and related work#

Datasets and metrics#

Evaluation metrics#

Nds.#

Methodology of bev perception#

Method#

Bev camera#

General pipeline#

Methods#

Modality#

View transformation#

Discussion on bev and perspective methods#

Bev lidar#

Pre-bev feature extraction#

Post-bev feature extraction#

Discussion#

Bev fusion#

Lidar-camera fusion#

Temporal fusion#

Industrial design of bev perception#

Input data#

Feature extractor#

Pv to bev transformation#

Fusion module#

Temporal & spatial module#

Prediction head#

Empirical evaluation and recipe#

Data augmentation#

Bev camera (camera-only) detection#

Lidar segmentaion#

Bev encoder#

Bev camera: bevformer++#

Bev lidar: voxel-spvcnn#

3d detection head in bevformer++#

Test-time augmentation (tta)#

Bev camera-only detection#

Lidar segmentation#

Loss#

Lidar segmentation#

Ensemble#

Post-processing#

Conlusion#

Link to paper

Abstract

Paper Content

Introduction

Big picture at a glance

Motivation to bev perception research

Significance.

Space.

Contributions

Background in 3d perception

Task definition and related work

Datasets and metrics

Evaluation metrics

Nds.

Methodology of bev perception

Method

Bev camera

General pipeline

Methods

Modality

View transformation

Discussion on bev and perspective methods

Bev lidar

Pre-bev feature extraction

Post-bev feature extraction

Discussion

Bev fusion

Lidar-camera fusion

Temporal fusion

Industrial design of bev perception

Input data

Feature extractor

Pv to bev transformation

Fusion module

Temporal & spatial module

Prediction head

Empirical evaluation and recipe

Data augmentation

Bev camera (camera-only) detection

Lidar segmentaion

Bev encoder

Bev camera: bevformer++

Bev lidar: voxel-spvcnn

3d detection head in bevformer++

Test-time augmentation (tta)

Bev camera-only detection

Lidar segmentation

Loss

Lidar segmentation

Ensemble

Post-processing

Conlusion