Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Task of open-world class-agnostic object detection is addressed
RGB-based models suffer from overfitting and fail to detect novel-looking objects
Geometric cues such as depth and normals are incorporated to train an object proposal network
GOOD significantly improves detection recall for novel object categories and performs well with few training classes
Using a single “person” class for training on the COCO dataset, GOOD surpasses SOTA methods by 5.0% AR@100

Task of open-world class-agnostic object detection
RGB-based models suffer from overfitting and fail at detecting novel-looking objects
Incorporating geometric cues such as depth and normals to train an object proposal network
Geometry-guided Open-world Object Detector (GOOD) improves detection recall for novel object categories

The standard object detection task is to detect objects from a predefined class list.
In the open-world setup, object detectors are required to detect all the objects in the scene even though they have only been trained on objects from a limited number of classes.
Current state-of-the-art object detectors typically struggle in the open-world setup.
Estimating geometric cues such as depth and normals from a single RGB image has been an active research area for a long time.
Geometric cues possess built-in invariance to many changes and are more class-agnostic than RGB signals.
Monocular estimators for mid-level representations have significantly advanced in terms of prediction quality and generalization to novel scenes.
We propose to use a pseudo-labeling method for incorporating geometric cues into open-world object detector training.
Incorporating the geometry cues can significantly improve the detection recall of unseen objects.
Geometry cues help discover novel-looking objects that RGB-based detectors cannot detect.
RGB-based detectors can make use of more annotations with their strong representation learning ability to generalize to novel, unseen categories.

Models trained on RGB images tend to over-rely on appearance cues for object detection.
Involving novel objects into training can mitigate the appearance bias.
Geometric cues such as depth and normals are used to detect unannotated novel objects.

Train object proposal network on depth or normal input
Filter out detected bounding boxes that overlap with base class annotations
Add remaining top-k boxes to ground truth annotations
Train class-agnostic object detector using extended bounding box annotation pool
Use strong data augmentation during Phase-II training

Target two major challenges of open-world class-agnostic object detection: cross-category and cross-dataset generalization
Split 80 classes in COCO dataset into two parts: one for training, one for testing
First benchmark: single “person” class and 79 non-person classes
Second benchmark: 20 PASCAL-VOC classes for training, 60 for testing
Cross-dataset evaluation: ADE20K dataset for testing

Geometric cues are less sensitive to appearance shifts across classes.
Geometric cues can achieve higher per-novel-class AR@5 than RGB in many categories.
Geometric cues are better than edges and pairwise affinities.
Geometric cues are better than adding shape bias and multi-scale augmentation.
Depth and normals outperform 2D edge and PA on detecting novel objects.
Data augmentation is helpful to counteract the noise in pseudo labels.
More base classes in training allow object detectors to better detect novel objects.
Ensembling pseudo labels is slightly better than using stacked inputs for Phase-I training.
Stacking RGB with geometric cues leads to inferior performance.
Adding pseudo boxes from RGB leads to no performance gains or even worsens the performance.
Object proposal networks trained on RGB favor smaller detection boxes.