Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Task of open-world class-agnostic object detection is addressed
  • RGB-based models suffer from overfitting and fail to detect novel-looking objects
  • Geometric cues such as depth and normals are incorporated to train an object proposal network
  • GOOD significantly improves detection recall for novel object categories and performs well with few training classes
  • Using a single “person” class for training on the COCO dataset, GOOD surpasses SOTA methods by 5.0% AR@100

Paper Content


  • Task of open-world class-agnostic object detection
  • RGB-based models suffer from overfitting and fail at detecting novel-looking objects
  • Incorporating geometric cues such as depth and normals to train an object proposal network
  • Geometry-guided Open-world Object Detector (GOOD) improves detection recall for novel object categories


  • The standard object detection task is to detect objects from a predefined class list.
  • In the open-world setup, object detectors are required to detect all the objects in the scene even though they have only been trained on objects from a limited number of classes.
  • Current state-of-the-art object detectors typically struggle in the open-world setup.
  • Estimating geometric cues such as depth and normals from a single RGB image has been an active research area for a long time.
  • Geometric cues possess built-in invariance to many changes and are more class-agnostic than RGB signals.
  • Monocular estimators for mid-level representations have significantly advanced in terms of prediction quality and generalization to novel scenes.
  • We propose to use a pseudo-labeling method for incorporating geometric cues into open-world object detector training.
  • Incorporating the geometry cues can significantly improve the detection recall of unseen objects.
  • Geometry cues help discover novel-looking objects that RGB-based detectors cannot detect.
  • RGB-based detectors can make use of more annotations with their strong representation learning ability to generalize to novel, unseen categories.

Exploiting geometric cues

  • Models trained on RGB images tend to over-rely on appearance cues for object detection.
  • Involving novel objects into training can mitigate the appearance bias.
  • Geometric cues such as depth and normals are used to detect unannotated novel objects.

Pseudo labeling method

  • Train object proposal network on depth or normal input
  • Filter out detected bounding boxes that overlap with base class annotations
  • Add remaining top-k boxes to ground truth annotations
  • Train class-agnostic object detector using extended bounding box annotation pool
  • Use strong data augmentation during Phase-II training


  • Target two major challenges of open-world class-agnostic object detection: cross-category and cross-dataset generalization
  • Split 80 classes in COCO dataset into two parts: one for training, one for testing
  • First benchmark: single “person” class and 79 non-person classes
  • Second benchmark: 20 PASCAL-VOC classes for training, 60 for testing
  • Cross-dataset evaluation: ADE20K dataset for testing

Advantages of geometric cues

  • Geometric cues are less sensitive to appearance shifts across classes.
  • Geometric cues can achieve higher per-novel-class AR@5 than RGB in many categories.
  • Geometric cues are better than edges and pairwise affinities.
  • Geometric cues are better than adding shape bias and multi-scale augmentation.
  • Depth and normals outperform 2D edge and PA on detecting novel objects.
  • Data augmentation is helpful to counteract the noise in pseudo labels.
  • More base classes in training allow object detectors to better detect novel objects.
  • Ensembling pseudo labels is slightly better than using stacked inputs for Phase-I training.
  • Stacking RGB with geometric cues leads to inferior performance.
  • Adding pseudo boxes from RGB leads to no performance gains or even worsens the performance.
  • Object proposal networks trained on RGB favor smaller detection boxes.