Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Goal of paper is to detect objects by exploiting their interrelationships.
- Infer graph prior from object co-occurrence statistics.
- Model object relations as a function of initial class predictions and co-occurrence priors.
- Learn object-relation joint distribution via energy based modeling.
- Experiments show method is detector agnostic, end-to-end trainable, and beneficial for rare object classes.
- Establish consistent improvement over object detectors and state-of-the-art methods.
Paper Content
Introduction
- Object detection is a classical task in computer vision
- Deep convolutional networks and self-attention based transformer networks are used
- These architectures focus on the image space for feature representation
- Jiang et al. hypothesizes that incorporating prior knowledge in the training of a detector should improve detection
- Xu et al. propose to calculate object relations solely from object features and their spatial orientation
- Xu et al. also introduce a different way of leveraging the prior information
- We propose a novel way of representing object-relations in an image
- We propose a way to jointly model object-relation distributions
- We create an end-to-end framework where relation edges are created based on initial class predictions
- We demonstrate the potential of our graph priors for object detection
- We present an energy-based framework for learning the object-relation distribution in an image
- Our method is detector agnostic, end-to-end trainable, and beneficial for rare object classes
- We establish a consistent improvement over object detectors and state-of-the-art methods
Related works
- Object detection models can be divided into convolution and transformer networks
- ObjectBox treats objects as center-points in a shape and size agnostic way
- Object detectors use spatial context information in the image space
- Early works used object relations as a post-processing step
- Recent works model objects and relations together in a graph structure
- Works exploit priors like cooccurrence or attributes of object classes to obtain a graph representation
- Energy-based models bring simplicity and generality in likelihood modeling
Method
Graph priors for object detection
- Visual features are extracted from an input image
- Prior knowledge is used to create an edge connectivity matrix
- Visual Genome dataset is used as a source of prior knowledge
- Graphical representation of the input image is obtained based on feature vectors and edge values
- Edge connectivity matrix is calculated based on class predictions and prior matrix
- Joint task loss function is used to optimize for final task of classification
Potential of graph priors
- Experiment performed to test optimal condition of proposed graph formulation
- Experiment replaces set of edges with ground truth relation set
- Ground truth graph created by performing IoU-matching and looking up prior knowledge matrix
- Graph priors improve detection results of base network by large margin
- Edge matrix is a function of class probability matrix generated from feature vector set
Energy-based graph refinement
- Energy-based models represent joint probability distribution of two variables x and y
- Computing the normalization constant (Z(ฮธ)) is intractable
- Maximizing log-likelihood is equivalent to minimizing KL-divergence
- Stochastic gradient Langevin Dynamics (SGLD) used to approximate model distribution
- Represent image as graph G=(N, E)
- Learn conditional distribution p(z|N, E)
- Model joint distribution with classification model and energy function
- Initialize graph G0 with base detector predictions
- Update graph iteratively with energy function
- Optimize model parameters by minimizing joint loss function
- Inference: pass image through base network, refine graph t times, feed through message passing and classification model
Experiments
Datasets, evaluation and implementation
- Experiments conducted on Visual Genome and MS-COCO 2017 datasets
- Task is to localize and classify objects into preset categories
- Visual Genome dataset split into 87.9K for training and 5K for testing
- Evaluated on 80 object categories of MS-COCO 2017
- Base detector is Faster-RCNN and DETR
- Message passing and energy model used
- Results and source code made publicly available
State-of-the-art comparison
- Comparing our method to state-of-the-art on VG1000 and VG3000
- Achieved state-of-the-art results on almost all metrics on VG1000 and VG3000
- Good improvement in average recall numbers on VG1000 and VG3000
- Results on MS-COCO comparable to state-of-the-art
- Graph prior ablation study on VG1000
- Performance of our method for different proposal numbers on VG1000
- Exploiting object context information for object detection via graph priors
- Using graph priors during training and refinement at test time
- New state-of-the-art on Visual Genome data partitions for 1,000 and 3,000 object categories
- Enhanced features obtained by aggregating node and edge information
- Positive effects of exploiting co-occurrence priors
- Failure cases due to overconfident wrong results