Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

3D Semantic Scene Completion (SSC) can provide dense geometric and semantic scene representations for autonomous driving and robotic systems.
Depth information is necessary for 3D geometry restoration.
A stereo SSC method named OccDepth is proposed to exploit implicit depth information from stereo images or RGBD images.
A reformed TartanAir benchmark, named SemanticTartanAir, is provided for testing OccDepth on SSC task.
OccDepth achieves superior performance compared to the state-of-the-art RGB-inferred SSC method.

Paper Content

Introduction

Humans can use their eyes to construct 3D scenes
Dense representations of 3D scenes are useful for agents to perform tasks
Restoring 3D structures from 2D images is a challenging problem
3D Semantic Scene Completion (SSC) task attempts to reconstruct geometric and semantic structures
Most 3D SSC techniques depend on 3D signals with depth information
Aim to reduce gap between vision-only and 3D input solutions
OccDepth is the first stereo 3D SSC method with vision-only input
Stereo Soft Feature Assignment (Stereo-SFA) module to better fuse 3D depth-aware features
Occupancy Aware Depth (OAD) module to produce depth-aware 3D features
SemanticTartanAir dataset to validate stereo-input scene
OccDepth outperforms all vision-only baseline methods and is close to 3D input solutions

Methods

Stereo soft feature assignment module

A good 2D-3D lifting function is needed for semantic occupancy predicting.
Stereo images provide implicit depth information that solves the scale ambiguity.
Feature mapping between 2D image features and voxel features is needed.
The correlation between left and right 2D features is used to calculate weights for 3D features.

Occupancy-aware depth module

OAD module uses predicted depth information to improve spatial information of voxel features
Depth distribution network N et D predicts depth feature F D
Softmax operator transforms F D into frustum depth distribution G D
LID used for depth discretization
Occupancy aware voxel feature V occ obtained
Depth distribution network N et D trained with implicit and explicit supervision
Stereo depth distillation used to help depth distribution network
Ground truth depth D GT generated by stereo depth network LEAStereo
Binary cross entropy used to calculate depth loss

Losses

Task can be decomposed into occupancy prediction and semantic prediction
V 3D generated from 3D U-Net structure is used to predict occupancy
F 2class 3D and F N class 3D are used to generate occupancy and semantic losses
L mono loss is used to optimize OccDepth
SC IoU and SSC mIoU are reported for modified baselines and OccDepth

Tricks for mitigating over-fitting

2D Pre-training to enhance semantic information
Data Augmentation to mitigate lack of training data
Loss Weight Adjustment to reduce over-fitting losses

Experiments

Datasets and metric

SemanticKITTI dataset provides 256x256x32 voxel grids labeled with 21 classes
NYUv2 dataset provides 240x144x240 voxel grids labeled with 13 classes
SemanticTartanAir dataset provides 120x48x120 voxel grids labeled with 14 classes
Evaluation metric is intersection over union (IoU) of occupied voxels and mean IoU (mIoU) of semantic classes

Comparison to state-of-the-arts

OccDepth is compared to state-of-the-art SSC methods, which can be classified as RGB-inferred and 2.5D/3D-input methods
OccDepth outperforms other methods on SemanticKITTI and SemanticTartanAir
OccDepth achieves better results than MonoScene, a 2D baseline method
OccDepth has a close mIoU result compared to 2.5D/3D-input methods without using GT depth
OccDepth has higher IoU and mIoU than RGB-inferred methods

Method

OccDepth has better restoration effect of object edges on indoor scenes
OccDepth accurately identifies road signs, trunks, vehicles, and roads on outdoor scenes
All architectural components contribute to the best results
Stereo-SFA module brings significant improvements on both IoU and mIoU
OAD module brings significant improvement on mIoU with almost no increase in computation
Stereo-SFA is better than “Mean” and “Concat” fusion methods
LID depth discretization method achieves highest accuracy
Depth distillation with lidar depth data achieves higher accuracy than with stereo depth data

Conclusion

OccDepth is a depth-aware method for 3D SSC
OccDepth is trained on public datasets including SemanticKITTI and NYUv2
OccDepth outperforms current RGB-inferred baselines on all classes
OccDepth is the first stereo RGB-inferred method that is comparable to 2.5D/3D-input SSC methods
OccDepth captures better scene layout on both datasets
OccDepth is trained for 30 epochs with an AdamW optimizer

Link to paper#

Abstract#

Paper Content#

Introduction#

Methods#

Stereo soft feature assignment module#

Occupancy-aware depth module#

Losses#

Tricks for mitigating over-fitting#

Experiments#

Datasets and metric#

Comparison to state-of-the-arts#

Method#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Methods

Stereo soft feature assignment module

Occupancy-aware depth module

Losses

Tricks for mitigating over-fitting

Experiments

Datasets and metric

Comparison to state-of-the-arts

Method

Conclusion