Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • 3D object detectors usually rely on hand-crafted proxies
  • Propose VoxelNext for fully sparse 3D object detection
  • Predicts objects directly based on sparse voxel features
  • Elegant and efficient framework, no need for sparse-to-dense conversion or NMS post-processing
  • Better speed-accuracy trade-off than other mainframe detectors on nuScenes dataset

Paper Content

Introduction

  • 3D perception is a fundamental component in autonomous driving systems
  • 3D detection networks take sparse point clouds or voxels as input
  • Recent 3D object detectors use sparse convolutional networks for feature extraction
  • Anchors and centers are used for prediction
  • Mainstream detectors convert 3D sparse features to 2D dense features
  • VoxelNeXt is a simple, efficient, and post-processing-free 3D object detector
  • VoxelNeXt predicts 3D objects from voxel features with a fully sparse convolutional network
  • VoxelNeXt is evaluated on three large-scale benchmarks and achieves leading performance with high efficiency
  • 3D detectors work similarly to 2D counterparts
  • Many approaches still use 2D dense convolutional heads
  • VoxelNet uses PointNet for voxel feature encoding
  • SECOND improves Voxel-Net with dense anchor-based head
  • Other state-of-the-art methods use sparse-to-dense scheme
  • CenterPoint predicts dense heatmap of center locations
  • Sparse Detectors avoid dense detection heads
  • Sparse CNNs are used for 3D deep learning
  • Sparse CNNs have limited representation ability
  • 3D object tracking models tracklets of multiple objects

Fully sparse voxel-based network

  • Point clouds or voxels are scattered on the surface of 3D objects.
  • Aim to predict 3D boxes directly from voxels instead of hand-crafted anchors or centers.
  • Backbone adaptation, sparse head design, and 3D object tracking are introduced.

Sparse cnn backbone adaptation

  • Feature representation with sufficient receptive fields is necessary for correct prediction on sparse voxel features.
  • To enhance the plain sparse CNN backbone network, additional down-sampling layers are used.
  • Features with strides {16, 32} are obtained for {F 5 , F 6 }.
  • Receptive fields are enlarged and prediction is more accurate.
  • Height compression is done in a fully sparse way.

Sparse prediction head

  • Voxel Selection predicts scores of voxels for K classes
  • During training, voxel nearest to annotated bounding box center is assigned as positive sample
  • During inference, sparse max pooling is used to select voxels with spatially local maximums
  • Bounding boxes are directly regressed from positive or selected sparse voxel features
  • Predictions are supervised under L1 loss function during training

Experiments

  • Ablated the effect of down-sampling layers in VoxelNeXt
  • Extended it to variants Ds, where s denotes the number of down-sampling
  • Without dense head, D3 suffers from performance drop
  • Performance gradually increases from D3 to D5
  • Added one more variant with 5x5x5 kernel size
  • VoxelNeXt gradually drops redundant voxels according to feature magnitude
  • Drop ratio set to 0.5 as default
  • Ablated stages of voxel pruning
  • Combined 3D backbone and 2D sparse prediction head for better efficiency
  • Ablated effect of fully-connected layers or submanifold sparse convolutions to predict boxes in the sparse head
  • Most boxes are predicted from voxels inside, not near centers
  • Large gaps between ratios of different classes
  • Compared VoxelNeXt to CenterPoint
  • VoxelNeXt achieves 0.9% mAP and 1.0% NDS improvement
  • VoxelNeXt has 4.9% less orientation error than CenterPoint
  • Counted efficiency-related statistics of sparse CNN backbone network
  • Ablated effect of sparse max pooling and NMS
  • Ablated 3D tracking on nuScenes validation
  • Evaluated detection models on nuScenes and Waymo test split
  • Compared VoxelNeXt’s tracking performance with other methods on nuScenes and Argoverse2
  • VoxelNeXt achieves leading performance among methods with high efficiency
  • VoxelNeXt ranks 1st on nuScenes 3D tracking LIDAR benchmark
  • Gap between theoretical FLOPs and actual inference speed
  • VoxelNeXt has much smaller 38.7G FLOPs compared to 186.6G of CenterPoint