Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Object models are moving from predicting category labels to providing detailed descriptions of object instances.
  • PACO is a dataset that provides part masks, attributes, and object categories across image and video datasets.
  • PACO contains 641K part masks, 260K object boxes, and 55 attributes.
  • Evaluation metrics and benchmark results are provided for 3 tasks on the dataset.
  • Dataset, models, and code are open-sourced.

Paper Content

Introduction

  • Tasks requiring fine-grained understanding of objects are gaining importance
  • Representing objects through category labels is no longer sufficient
  • No large benchmark datasets for common objects with joint annotation of part masks, object attributes and part attributes
  • Datasets with part masks for common objects are limited
  • Large-scale datasets with object-level attributes exist, but none with part-level attribute annotations
  • PACO dataset provides 641K part masks annotated in 77K images for 260K object instances
  • PACO provides annotations for 55 different attributes for both objects and parts
  • PACO dataset and benchmarks open-sourced
  • Availability of large-scale datasets has accelerated object understanding
  • Popular benchmark datasets for object detection and segmentation include COCO, LVIS, Object365, Open Images and Pascal
  • Domain-specific datasets for fashion, medical images and OCR
  • LVIS introduced federated annotations to scale to larger vocabularies
  • Part datasets provide pixel-level part annotations for common objects
  • Attributes have long been used to describe objects
  • Visual Attributes in the Wild (VAW) dataset for attribute classification
  • Fashionpedia dataset for fashion provides part and attribute annotations
  • PACO aims to generalize this to common object categories
  • Attributes used for zero-shot object recognition
  • PACO benchmarks part and attribute based queries

Dataset construction

Image sources

  • PACO is constructed from two sources: LVIS in the image domain and Ego4D in the video domain.
  • LVIS was chosen due to its large object vocabulary and federated dataset construction.
  • Ego4D was chosen due to its temporally aligned narrations.

Object vocabulary selection

  • Mined object categories from narrations accompanying Ego4D
  • Intersected with common and frequent categories in LVIS
  • Chose categories with at least 20 instances in Ego4D
  • Resulted in 75 categories commonly found in both LVIS and Ego4D

Parts vocabulary selection

  • No exhaustive ontology of parts for common objects
  • Part names mined from web-images
  • Part names manually curated to retain visible and distinguishable parts
  • 200 part classes shared across 75 objects, 456 object-part classes when expanded

Attribute vocabulary selection

  • Attributes can be used to distinguish different objects
  • User study conducted to identify sufficient set of attributes
  • Final vocabulary of 29 colors, 10 patterns and markings, 13 materials and 3 levels of reflectance

Annotation pipeline

  • Object bounding box and mask annotation for Ego4D
  • Part mask annotation for both LVIS and Ego4D
  • Object and part attributes annotation for both LVIS and Ego4D
  • Instance IDs annotation for Ego4D
  • Negative images annotated for each object class
  • 10-50 instances of each object annotated by expert annotators
  • 90% of object classes have mIoU โ‰ฅ 0.75 with gold-annotated masks
  • Accuracy of 85% on gold annotations provided by expert annotators

Dataset statistics

  • PACO-LVIS has 641K part mask annotations
  • PACO-LVIS has a long-tail distribution of part masks
  • PACO has 10x more object instances with parts than PartsImageNet
  • PACO has 297K part masks with attribute annotations
  • VAW has 261K object masks with attributes, but PACO has 421K
  • PACO has a larger density of attributes per image than VAW

Tasks and evaluation benchmark

  • Evaluate quality of parts segmentation and attributes prediction
  • Leverage parts and attributes for zero-shot object instance detection

Dataset splits

  • PACO-LVIS and PACO-EGO4D datasets are split into train, val and test sets.
  • Test split of PACO-LVIS contains 9443 images.
  • Train and val splits of PACO-LVIS contain 45790 and 2410 images respectively.
  • Ego4D is split into 15667 train, 825 val and 9892 test images.
  • Object instance IDs in Ego4D train and test sets are disjoint.

Federated dataset for object categories

  • Federated dataset from LVIS [15] has three types of images: negative, exhaustive positive, and non-exhaustive positive
  • Negative images do not contain any instance of the object
  • Exhaustive positive images have all instances of the object annotated
  • Non-exhaustive positive images have at least one instance of the object annotated, but not all instances

Part segmentation

  • Algorithm detects and segments part-masks of different object instances in an unseen image
  • Assigns an (object, part) label with a confidence score to the part-mask
  • (Object, part) pairs are from a fixed known set
  • Evaluates task for (object, part) labels instead of only part labels
  • Uses mask and box Average Precision (AP) metrics defined in COCO
  • Computes AP for (c, a) at different IoU thresholds and reports the average

Instance-level attributes prediction

  • Trained a simple extension of Mask R-CNN and ViTdet models with an additional attribute head
  • Reported box AP values for models trained on PACO-LVIS
  • Attribute prediction is harder than object detection
  • Larger models fare better for this task
  • Analyzed sensitivity of AP obj attr to attribute prediction
  • Compared AR@1 on a subset of queries for zero-shot instance detection

Zero-shot instance detection

  • Generated benchmark numbers for task by leveraging models trained in section
  • Used scores corresponding to object, part, object attributes, and part attributes to rank object bounding boxes
  • Combined scores using geometric mean to get one final score for each box
  • Results show L1 > L3 > L2 due to trade-off between two opposing factors
  • Ablation studies measure importance of different object and part attributes
  • Trained two mask R-CNN and two ViT-det models with 531 classes
  • Evaluated publicly available models from Detic and MDETR without further fine-tuning
  • Results show limited performance for evaluated models
  • PACO-EGO4D can serve as useful dataset for few-shot instance detection
  • Benchmarked naive 2-stage model for k ranging from 1-5 and compared to zero-shot model
  • 20+ point gap between best zero-shot model and one-shot model, gap widens as k increases

Conclusion

  • Introduced PACO, a dataset designed to enable research towards joint detection of objects, parts and attributes of common objects
  • 75 common object categories spanning both image and video datasets
  • 3 benchmark tasks which showcase unique challenges in the dataset
  • Manually defined parts with sample reference images
  • 5 attribute types: color, shape, reflectance, materials and patterns & marking
  • 98% of object instance pairs could be distinguished only using the 55 attributes included in PACO
  • Color is the biggest discriminative attribute type
  • Instances annotated with unique instance IDs
  • Distribution of instances across 75 object categories in PACO-LVIS and PACO-EGO4D
  • Models used to train the joint segmentation and attribute prediction models
  • Evaluated models on the test splits for both the datasets
  • Explored the effect of joint training on multiple tasks on object segmentation results
  • Calculated lower and upper bounds on AP obj att
  • Part association process to associate part boxes to objects