Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Object models are moving from predicting category labels to providing detailed descriptions of object instances.
PACO is a dataset that provides part masks, attributes, and object categories across image and video datasets.
PACO contains 641K part masks, 260K object boxes, and 55 attributes.
Evaluation metrics and benchmark results are provided for 3 tasks on the dataset.
Dataset, models, and code are open-sourced.

Paper Content

Introduction

Tasks requiring fine-grained understanding of objects are gaining importance
Representing objects through category labels is no longer sufficient
No large benchmark datasets for common objects with joint annotation of part masks, object attributes and part attributes
Datasets with part masks for common objects are limited
Large-scale datasets with object-level attributes exist, but none with part-level attribute annotations
PACO dataset provides 641K part masks annotated in 77K images for 260K object instances
PACO provides annotations for 55 different attributes for both objects and parts
PACO dataset and benchmarks open-sourced

Availability of large-scale datasets has accelerated object understanding
Popular benchmark datasets for object detection and segmentation include COCO, LVIS, Object365, Open Images and Pascal
Domain-specific datasets for fashion, medical images and OCR
LVIS introduced federated annotations to scale to larger vocabularies
Part datasets provide pixel-level part annotations for common objects
Attributes have long been used to describe objects
Visual Attributes in the Wild (VAW) dataset for attribute classification
Fashionpedia dataset for fashion provides part and attribute annotations
PACO aims to generalize this to common object categories
Attributes used for zero-shot object recognition
PACO benchmarks part and attribute based queries

Dataset construction

Image sources

PACO is constructed from two sources: LVIS in the image domain and Ego4D in the video domain.
LVIS was chosen due to its large object vocabulary and federated dataset construction.
Ego4D was chosen due to its temporally aligned narrations.

Object vocabulary selection

Mined object categories from narrations accompanying Ego4D
Intersected with common and frequent categories in LVIS
Chose categories with at least 20 instances in Ego4D
Resulted in 75 categories commonly found in both LVIS and Ego4D

Parts vocabulary selection

No exhaustive ontology of parts for common objects
Part names mined from web-images
Part names manually curated to retain visible and distinguishable parts
200 part classes shared across 75 objects, 456 object-part classes when expanded

Attribute vocabulary selection

Attributes can be used to distinguish different objects
User study conducted to identify sufficient set of attributes
Final vocabulary of 29 colors, 10 patterns and markings, 13 materials and 3 levels of reflectance

Annotation pipeline

Object bounding box and mask annotation for Ego4D
Part mask annotation for both LVIS and Ego4D
Object and part attributes annotation for both LVIS and Ego4D
Instance IDs annotation for Ego4D
Negative images annotated for each object class
10-50 instances of each object annotated by expert annotators
90% of object classes have mIoU ≥ 0.75 with gold-annotated masks
Accuracy of 85% on gold annotations provided by expert annotators

Dataset statistics

PACO-LVIS has 641K part mask annotations
PACO-LVIS has a long-tail distribution of part masks
PACO has 10x more object instances with parts than PartsImageNet
PACO has 297K part masks with attribute annotations
VAW has 261K object masks with attributes, but PACO has 421K
PACO has a larger density of attributes per image than VAW

Tasks and evaluation benchmark

Evaluate quality of parts segmentation and attributes prediction
Leverage parts and attributes for zero-shot object instance detection

Dataset splits

PACO-LVIS and PACO-EGO4D datasets are split into train, val and test sets.
Test split of PACO-LVIS contains 9443 images.
Train and val splits of PACO-LVIS contain 45790 and 2410 images respectively.
Ego4D is split into 15667 train, 825 val and 9892 test images.
Object instance IDs in Ego4D train and test sets are disjoint.

Federated dataset for object categories

Federated dataset from LVIS [15] has three types of images: negative, exhaustive positive, and non-exhaustive positive
Negative images do not contain any instance of the object
Exhaustive positive images have all instances of the object annotated
Non-exhaustive positive images have at least one instance of the object annotated, but not all instances

Part segmentation

Algorithm detects and segments part-masks of different object instances in an unseen image
Assigns an (object, part) label with a confidence score to the part-mask
(Object, part) pairs are from a fixed known set
Evaluates task for (object, part) labels instead of only part labels
Uses mask and box Average Precision (AP) metrics defined in COCO
Computes AP for (c, a) at different IoU thresholds and reports the average

Instance-level attributes prediction

Trained a simple extension of Mask R-CNN and ViTdet models with an additional attribute head
Reported box AP values for models trained on PACO-LVIS
Attribute prediction is harder than object detection
Larger models fare better for this task
Analyzed sensitivity of AP obj attr to attribute prediction
Compared AR@1 on a subset of queries for zero-shot instance detection

Zero-shot instance detection

Generated benchmark numbers for task by leveraging models trained in section
Used scores corresponding to object, part, object attributes, and part attributes to rank object bounding boxes
Combined scores using geometric mean to get one final score for each box
Results show L1 > L3 > L2 due to trade-off between two opposing factors
Ablation studies measure importance of different object and part attributes
Trained two mask R-CNN and two ViT-det models with 531 classes
Evaluated publicly available models from Detic and MDETR without further fine-tuning
Results show limited performance for evaluated models
PACO-EGO4D can serve as useful dataset for few-shot instance detection
Benchmarked naive 2-stage model for k ranging from 1-5 and compared to zero-shot model
20+ point gap between best zero-shot model and one-shot model, gap widens as k increases

Conclusion

Introduced PACO, a dataset designed to enable research towards joint detection of objects, parts and attributes of common objects
75 common object categories spanning both image and video datasets
3 benchmark tasks which showcase unique challenges in the dataset
Manually defined parts with sample reference images
5 attribute types: color, shape, reflectance, materials and patterns & marking
98% of object instance pairs could be distinguished only using the 55 attributes included in PACO
Color is the biggest discriminative attribute type
Instances annotated with unique instance IDs
Distribution of instances across 75 object categories in PACO-LVIS and PACO-EGO4D
Models used to train the joint segmentation and attribute prediction models
Evaluated models on the test splits for both the datasets
Explored the effect of joint training on multiple tasks on object segmentation results
Calculated lower and upper bounds on AP obj att
Part association process to associate part boxes to objects

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Dataset construction#

Image sources#

Object vocabulary selection#

Parts vocabulary selection#

Attribute vocabulary selection#

Annotation pipeline#

Dataset statistics#

Tasks and evaluation benchmark#

Dataset splits#

Federated dataset for object categories#

Part segmentation#

Instance-level attributes prediction#

Zero-shot instance detection#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Dataset construction

Image sources

Object vocabulary selection

Parts vocabulary selection

Attribute vocabulary selection

Annotation pipeline

Dataset statistics

Tasks and evaluation benchmark

Dataset splits

Federated dataset for object categories

Part segmentation

Instance-level attributes prediction

Zero-shot instance detection

Conclusion