Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- 3D Semantic Scene Completion (SSC) can provide dense geometric and semantic scene representations for autonomous driving and robotic systems.
- Depth information is necessary for 3D geometry restoration.
- A stereo SSC method named OccDepth is proposed to exploit implicit depth information from stereo images or RGBD images.
- A reformed TartanAir benchmark, named SemanticTartanAir, is provided for testing OccDepth on SSC task.
- OccDepth achieves superior performance compared to the state-of-the-art RGB-inferred SSC method.
Paper Content
Introduction
- Humans can use their eyes to construct 3D scenes
- Dense representations of 3D scenes are useful for agents to perform tasks
- Restoring 3D structures from 2D images is a challenging problem
- 3D Semantic Scene Completion (SSC) task attempts to reconstruct geometric and semantic structures
- Most 3D SSC techniques depend on 3D signals with depth information
- Aim to reduce gap between vision-only and 3D input solutions
- OccDepth is the first stereo 3D SSC method with vision-only input
- Stereo Soft Feature Assignment (Stereo-SFA) module to better fuse 3D depth-aware features
- Occupancy Aware Depth (OAD) module to produce depth-aware 3D features
- SemanticTartanAir dataset to validate stereo-input scene
- OccDepth outperforms all vision-only baseline methods and is close to 3D input solutions
Methods
Stereo soft feature assignment module
- A good 2D-3D lifting function is needed for semantic occupancy predicting.
- Stereo images provide implicit depth information that solves the scale ambiguity.
- Feature mapping between 2D image features and voxel features is needed.
- The correlation between left and right 2D features is used to calculate weights for 3D features.
Occupancy-aware depth module
- OAD module uses predicted depth information to improve spatial information of voxel features
- Depth distribution network N et D predicts depth feature F D
- Softmax operator transforms F D into frustum depth distribution G D
- LID used for depth discretization
- Occupancy aware voxel feature V occ obtained
- Depth distribution network N et D trained with implicit and explicit supervision
- Stereo depth distillation used to help depth distribution network
- Ground truth depth D GT generated by stereo depth network LEAStereo
- Binary cross entropy used to calculate depth loss
Losses
- Task can be decomposed into occupancy prediction and semantic prediction
- V 3D generated from 3D U-Net structure is used to predict occupancy
- F 2class 3D and F N class 3D are used to generate occupancy and semantic losses
- L mono loss is used to optimize OccDepth
- SC IoU and SSC mIoU are reported for modified baselines and OccDepth
Tricks for mitigating over-fitting
- 2D Pre-training to enhance semantic information
- Data Augmentation to mitigate lack of training data
- Loss Weight Adjustment to reduce over-fitting losses
Experiments
Datasets and metric
- SemanticKITTI dataset provides 256x256x32 voxel grids labeled with 21 classes
- NYUv2 dataset provides 240x144x240 voxel grids labeled with 13 classes
- SemanticTartanAir dataset provides 120x48x120 voxel grids labeled with 14 classes
- Evaluation metric is intersection over union (IoU) of occupied voxels and mean IoU (mIoU) of semantic classes
Comparison to state-of-the-arts
- OccDepth is compared to state-of-the-art SSC methods, which can be classified as RGB-inferred and 2.5D/3D-input methods
- OccDepth outperforms other methods on SemanticKITTI and SemanticTartanAir
- OccDepth achieves better results than MonoScene, a 2D baseline method
- OccDepth has a close mIoU result compared to 2.5D/3D-input methods without using GT depth
- OccDepth has higher IoU and mIoU than RGB-inferred methods
Method
- OccDepth has better restoration effect of object edges on indoor scenes
- OccDepth accurately identifies road signs, trunks, vehicles, and roads on outdoor scenes
- All architectural components contribute to the best results
- Stereo-SFA module brings significant improvements on both IoU and mIoU
- OAD module brings significant improvement on mIoU with almost no increase in computation
- Stereo-SFA is better than “Mean” and “Concat” fusion methods
- LID depth discretization method achieves highest accuracy
- Depth distillation with lidar depth data achieves higher accuracy than with stereo depth data
Conclusion
- OccDepth is a depth-aware method for 3D SSC
- OccDepth is trained on public datasets including SemanticKITTI and NYUv2
- OccDepth outperforms current RGB-inferred baselines on all classes
- OccDepth is the first stereo RGB-inferred method that is comparable to 2.5D/3D-input SSC methods
- OccDepth captures better scene layout on both datasets
- OccDepth is trained for 30 epochs with an AdamW optimizer