Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- 3D object detectors usually rely on hand-crafted proxies
- Propose VoxelNext for fully sparse 3D object detection
- Predicts objects directly based on sparse voxel features
- Elegant and efficient framework, no need for sparse-to-dense conversion or NMS post-processing
- Better speed-accuracy trade-off than other mainframe detectors on nuScenes dataset
Paper Content
Introduction
- 3D perception is a fundamental component in autonomous driving systems
- 3D detection networks take sparse point clouds or voxels as input
- Recent 3D object detectors use sparse convolutional networks for feature extraction
- Anchors and centers are used for prediction
- Mainstream detectors convert 3D sparse features to 2D dense features
- VoxelNeXt is a simple, efficient, and post-processing-free 3D object detector
- VoxelNeXt predicts 3D objects from voxel features with a fully sparse convolutional network
- VoxelNeXt is evaluated on three large-scale benchmarks and achieves leading performance with high efficiency
Related work
- 3D detectors work similarly to 2D counterparts
- Many approaches still use 2D dense convolutional heads
- VoxelNet uses PointNet for voxel feature encoding
- SECOND improves Voxel-Net with dense anchor-based head
- Other state-of-the-art methods use sparse-to-dense scheme
- CenterPoint predicts dense heatmap of center locations
- Sparse Detectors avoid dense detection heads
- Sparse CNNs are used for 3D deep learning
- Sparse CNNs have limited representation ability
- 3D object tracking models tracklets of multiple objects
Fully sparse voxel-based network
- Point clouds or voxels are scattered on the surface of 3D objects.
- Aim to predict 3D boxes directly from voxels instead of hand-crafted anchors or centers.
- Backbone adaptation, sparse head design, and 3D object tracking are introduced.
Sparse cnn backbone adaptation
- Feature representation with sufficient receptive fields is necessary for correct prediction on sparse voxel features.
- To enhance the plain sparse CNN backbone network, additional down-sampling layers are used.
- Features with strides {16, 32} are obtained for {F 5 , F 6 }.
- Receptive fields are enlarged and prediction is more accurate.
- Height compression is done in a fully sparse way.
Sparse prediction head
- Voxel Selection predicts scores of voxels for K classes
- During training, voxel nearest to annotated bounding box center is assigned as positive sample
- During inference, sparse max pooling is used to select voxels with spatially local maximums
- Bounding boxes are directly regressed from positive or selected sparse voxel features
- Predictions are supervised under L1 loss function during training
Experiments
- Ablated the effect of down-sampling layers in VoxelNeXt
- Extended it to variants Ds, where s denotes the number of down-sampling
- Without dense head, D3 suffers from performance drop
- Performance gradually increases from D3 to D5
- Added one more variant with 5x5x5 kernel size
- VoxelNeXt gradually drops redundant voxels according to feature magnitude
- Drop ratio set to 0.5 as default
- Ablated stages of voxel pruning
- Combined 3D backbone and 2D sparse prediction head for better efficiency
- Ablated effect of fully-connected layers or submanifold sparse convolutions to predict boxes in the sparse head
- Most boxes are predicted from voxels inside, not near centers
- Large gaps between ratios of different classes
- Compared VoxelNeXt to CenterPoint
- VoxelNeXt achieves 0.9% mAP and 1.0% NDS improvement
- VoxelNeXt has 4.9% less orientation error than CenterPoint
- Counted efficiency-related statistics of sparse CNN backbone network
- Ablated effect of sparse max pooling and NMS
- Ablated 3D tracking on nuScenes validation
- Evaluated detection models on nuScenes and Waymo test split
- Compared VoxelNeXt’s tracking performance with other methods on nuScenes and Argoverse2
- VoxelNeXt achieves leading performance among methods with high efficiency
- VoxelNeXt ranks 1st on nuScenes 3D tracking LIDAR benchmark
- Gap between theoretical FLOPs and actual inference speed
- VoxelNeXt has much smaller 38.7G FLOPs compared to 186.6G of CenterPoint