Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Goal of paper is to detect objects by exploiting their interrelationships.
  • Infer graph prior from object co-occurrence statistics.
  • Model object relations as a function of initial class predictions and co-occurrence priors.
  • Learn object-relation joint distribution via energy based modeling.
  • Experiments show method is detector agnostic, end-to-end trainable, and beneficial for rare object classes.
  • Establish consistent improvement over object detectors and state-of-the-art methods.

Paper Content

Introduction

  • Object detection is a classical task in computer vision
  • Deep convolutional networks and self-attention based transformer networks are used
  • These architectures focus on the image space for feature representation
  • Jiang et al. hypothesizes that incorporating prior knowledge in the training of a detector should improve detection
  • Xu et al. propose to calculate object relations solely from object features and their spatial orientation
  • Xu et al. also introduce a different way of leveraging the prior information
  • We propose a novel way of representing object-relations in an image
  • We propose a way to jointly model object-relation distributions
  • We create an end-to-end framework where relation edges are created based on initial class predictions
  • We demonstrate the potential of our graph priors for object detection
  • We present an energy-based framework for learning the object-relation distribution in an image
  • Our method is detector agnostic, end-to-end trainable, and beneficial for rare object classes
  • We establish a consistent improvement over object detectors and state-of-the-art methods
  • Object detection models can be divided into convolution and transformer networks
  • ObjectBox treats objects as center-points in a shape and size agnostic way
  • Object detectors use spatial context information in the image space
  • Early works used object relations as a post-processing step
  • Recent works model objects and relations together in a graph structure
  • Works exploit priors like cooccurrence or attributes of object classes to obtain a graph representation
  • Energy-based models bring simplicity and generality in likelihood modeling

Method

Graph priors for object detection

  • Visual features are extracted from an input image
  • Prior knowledge is used to create an edge connectivity matrix
  • Visual Genome dataset is used as a source of prior knowledge
  • Graphical representation of the input image is obtained based on feature vectors and edge values
  • Edge connectivity matrix is calculated based on class predictions and prior matrix
  • Joint task loss function is used to optimize for final task of classification

Potential of graph priors

  • Experiment performed to test optimal condition of proposed graph formulation
  • Experiment replaces set of edges with ground truth relation set
  • Ground truth graph created by performing IoU-matching and looking up prior knowledge matrix
  • Graph priors improve detection results of base network by large margin
  • Edge matrix is a function of class probability matrix generated from feature vector set

Energy-based graph refinement

  • Energy-based models represent joint probability distribution of two variables x and y
  • Computing the normalization constant (Z(ฮธ)) is intractable
  • Maximizing log-likelihood is equivalent to minimizing KL-divergence
  • Stochastic gradient Langevin Dynamics (SGLD) used to approximate model distribution
  • Represent image as graph G=(N, E)
  • Learn conditional distribution p(z|N, E)
  • Model joint distribution with classification model and energy function
  • Initialize graph G0 with base detector predictions
  • Update graph iteratively with energy function
  • Optimize model parameters by minimizing joint loss function
  • Inference: pass image through base network, refine graph t times, feed through message passing and classification model

Experiments

Datasets, evaluation and implementation

  • Experiments conducted on Visual Genome and MS-COCO 2017 datasets
  • Task is to localize and classify objects into preset categories
  • Visual Genome dataset split into 87.9K for training and 5K for testing
  • Evaluated on 80 object categories of MS-COCO 2017
  • Base detector is Faster-RCNN and DETR
  • Message passing and energy model used
  • Results and source code made publicly available

State-of-the-art comparison

  • Comparing our method to state-of-the-art on VG1000 and VG3000
  • Achieved state-of-the-art results on almost all metrics on VG1000 and VG3000
  • Good improvement in average recall numbers on VG1000 and VG3000
  • Results on MS-COCO comparable to state-of-the-art
  • Graph prior ablation study on VG1000
  • Performance of our method for different proposal numbers on VG1000
  • Exploiting object context information for object detection via graph priors
  • Using graph priors during training and refinement at test time
  • New state-of-the-art on Visual Genome data partitions for 1,000 and 3,000 object categories
  • Enhanced features obtained by aggregating node and edge information
  • Positive effects of exploiting co-occurrence priors
  • Failure cases due to overconfident wrong results