Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • 3D Semantic Scene Completion (SSC) can provide dense geometric and semantic scene representations for autonomous driving and robotic systems.
  • Depth information is necessary for 3D geometry restoration.
  • A stereo SSC method named OccDepth is proposed to exploit implicit depth information from stereo images or RGBD images.
  • A reformed TartanAir benchmark, named SemanticTartanAir, is provided for testing OccDepth on SSC task.
  • OccDepth achieves superior performance compared to the state-of-the-art RGB-inferred SSC method.

Paper Content

Introduction

  • Humans can use their eyes to construct 3D scenes
  • Dense representations of 3D scenes are useful for agents to perform tasks
  • Restoring 3D structures from 2D images is a challenging problem
  • 3D Semantic Scene Completion (SSC) task attempts to reconstruct geometric and semantic structures
  • Most 3D SSC techniques depend on 3D signals with depth information
  • Aim to reduce gap between vision-only and 3D input solutions
  • OccDepth is the first stereo 3D SSC method with vision-only input
  • Stereo Soft Feature Assignment (Stereo-SFA) module to better fuse 3D depth-aware features
  • Occupancy Aware Depth (OAD) module to produce depth-aware 3D features
  • SemanticTartanAir dataset to validate stereo-input scene
  • OccDepth outperforms all vision-only baseline methods and is close to 3D input solutions

Methods

Stereo soft feature assignment module

  • A good 2D-3D lifting function is needed for semantic occupancy predicting.
  • Stereo images provide implicit depth information that solves the scale ambiguity.
  • Feature mapping between 2D image features and voxel features is needed.
  • The correlation between left and right 2D features is used to calculate weights for 3D features.

Occupancy-aware depth module

  • OAD module uses predicted depth information to improve spatial information of voxel features
  • Depth distribution network N et D predicts depth feature F D
  • Softmax operator transforms F D into frustum depth distribution G D
  • LID used for depth discretization
  • Occupancy aware voxel feature V occ obtained
  • Depth distribution network N et D trained with implicit and explicit supervision
  • Stereo depth distillation used to help depth distribution network
  • Ground truth depth D GT generated by stereo depth network LEAStereo
  • Binary cross entropy used to calculate depth loss

Losses

  • Task can be decomposed into occupancy prediction and semantic prediction
  • V 3D generated from 3D U-Net structure is used to predict occupancy
  • F 2class 3D and F N class 3D are used to generate occupancy and semantic losses
  • L mono loss is used to optimize OccDepth
  • SC IoU and SSC mIoU are reported for modified baselines and OccDepth

Tricks for mitigating over-fitting

  • 2D Pre-training to enhance semantic information
  • Data Augmentation to mitigate lack of training data
  • Loss Weight Adjustment to reduce over-fitting losses

Experiments

Datasets and metric

  • SemanticKITTI dataset provides 256x256x32 voxel grids labeled with 21 classes
  • NYUv2 dataset provides 240x144x240 voxel grids labeled with 13 classes
  • SemanticTartanAir dataset provides 120x48x120 voxel grids labeled with 14 classes
  • Evaluation metric is intersection over union (IoU) of occupied voxels and mean IoU (mIoU) of semantic classes

Comparison to state-of-the-arts

  • OccDepth is compared to state-of-the-art SSC methods, which can be classified as RGB-inferred and 2.5D/3D-input methods
  • OccDepth outperforms other methods on SemanticKITTI and SemanticTartanAir
  • OccDepth achieves better results than MonoScene, a 2D baseline method
  • OccDepth has a close mIoU result compared to 2.5D/3D-input methods without using GT depth
  • OccDepth has higher IoU and mIoU than RGB-inferred methods

Method

  • OccDepth has better restoration effect of object edges on indoor scenes
  • OccDepth accurately identifies road signs, trunks, vehicles, and roads on outdoor scenes
  • All architectural components contribute to the best results
  • Stereo-SFA module brings significant improvements on both IoU and mIoU
  • OAD module brings significant improvement on mIoU with almost no increase in computation
  • Stereo-SFA is better than “Mean” and “Concat” fusion methods
  • LID depth discretization method achieves highest accuracy
  • Depth distillation with lidar depth data achieves higher accuracy than with stereo depth data

Conclusion

  • OccDepth is a depth-aware method for 3D SSC
  • OccDepth is trained on public datasets including SemanticKITTI and NYUv2
  • OccDepth outperforms current RGB-inferred baselines on all classes
  • OccDepth is the first stereo RGB-inferred method that is comparable to 2.5D/3D-input SSC methods
  • OccDepth captures better scene layout on both datasets
  • OccDepth is trained for 30 epochs with an AdamW optimizer