Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Object models are moving from predicting category labels to providing detailed descriptions of object instances.
- PACO is a dataset that provides part masks, attributes, and object categories across image and video datasets.
- PACO contains 641K part masks, 260K object boxes, and 55 attributes.
- Evaluation metrics and benchmark results are provided for 3 tasks on the dataset.
- Dataset, models, and code are open-sourced.
Paper Content
Introduction
- Tasks requiring fine-grained understanding of objects are gaining importance
- Representing objects through category labels is no longer sufficient
- No large benchmark datasets for common objects with joint annotation of part masks, object attributes and part attributes
- Datasets with part masks for common objects are limited
- Large-scale datasets with object-level attributes exist, but none with part-level attribute annotations
- PACO dataset provides 641K part masks annotated in 77K images for 260K object instances
- PACO provides annotations for 55 different attributes for both objects and parts
- PACO dataset and benchmarks open-sourced
Related work
- Availability of large-scale datasets has accelerated object understanding
- Popular benchmark datasets for object detection and segmentation include COCO, LVIS, Object365, Open Images and Pascal
- Domain-specific datasets for fashion, medical images and OCR
- LVIS introduced federated annotations to scale to larger vocabularies
- Part datasets provide pixel-level part annotations for common objects
- Attributes have long been used to describe objects
- Visual Attributes in the Wild (VAW) dataset for attribute classification
- Fashionpedia dataset for fashion provides part and attribute annotations
- PACO aims to generalize this to common object categories
- Attributes used for zero-shot object recognition
- PACO benchmarks part and attribute based queries
Dataset construction
Image sources
- PACO is constructed from two sources: LVIS in the image domain and Ego4D in the video domain.
- LVIS was chosen due to its large object vocabulary and federated dataset construction.
- Ego4D was chosen due to its temporally aligned narrations.
Object vocabulary selection
- Mined object categories from narrations accompanying Ego4D
- Intersected with common and frequent categories in LVIS
- Chose categories with at least 20 instances in Ego4D
- Resulted in 75 categories commonly found in both LVIS and Ego4D
Parts vocabulary selection
- No exhaustive ontology of parts for common objects
- Part names mined from web-images
- Part names manually curated to retain visible and distinguishable parts
- 200 part classes shared across 75 objects, 456 object-part classes when expanded
Attribute vocabulary selection
- Attributes can be used to distinguish different objects
- User study conducted to identify sufficient set of attributes
- Final vocabulary of 29 colors, 10 patterns and markings, 13 materials and 3 levels of reflectance
Annotation pipeline
- Object bounding box and mask annotation for Ego4D
- Part mask annotation for both LVIS and Ego4D
- Object and part attributes annotation for both LVIS and Ego4D
- Instance IDs annotation for Ego4D
- Negative images annotated for each object class
- 10-50 instances of each object annotated by expert annotators
- 90% of object classes have mIoU โฅ 0.75 with gold-annotated masks
- Accuracy of 85% on gold annotations provided by expert annotators
Dataset statistics
- PACO-LVIS has 641K part mask annotations
- PACO-LVIS has a long-tail distribution of part masks
- PACO has 10x more object instances with parts than PartsImageNet
- PACO has 297K part masks with attribute annotations
- VAW has 261K object masks with attributes, but PACO has 421K
- PACO has a larger density of attributes per image than VAW
Tasks and evaluation benchmark
- Evaluate quality of parts segmentation and attributes prediction
- Leverage parts and attributes for zero-shot object instance detection
Dataset splits
- PACO-LVIS and PACO-EGO4D datasets are split into train, val and test sets.
- Test split of PACO-LVIS contains 9443 images.
- Train and val splits of PACO-LVIS contain 45790 and 2410 images respectively.
- Ego4D is split into 15667 train, 825 val and 9892 test images.
- Object instance IDs in Ego4D train and test sets are disjoint.
Federated dataset for object categories
- Federated dataset from LVIS [15] has three types of images: negative, exhaustive positive, and non-exhaustive positive
- Negative images do not contain any instance of the object
- Exhaustive positive images have all instances of the object annotated
- Non-exhaustive positive images have at least one instance of the object annotated, but not all instances
Part segmentation
- Algorithm detects and segments part-masks of different object instances in an unseen image
- Assigns an (object, part) label with a confidence score to the part-mask
- (Object, part) pairs are from a fixed known set
- Evaluates task for (object, part) labels instead of only part labels
- Uses mask and box Average Precision (AP) metrics defined in COCO
- Computes AP for (c, a) at different IoU thresholds and reports the average
Instance-level attributes prediction
- Trained a simple extension of Mask R-CNN and ViTdet models with an additional attribute head
- Reported box AP values for models trained on PACO-LVIS
- Attribute prediction is harder than object detection
- Larger models fare better for this task
- Analyzed sensitivity of AP obj attr to attribute prediction
- Compared AR@1 on a subset of queries for zero-shot instance detection
Zero-shot instance detection
- Generated benchmark numbers for task by leveraging models trained in section
- Used scores corresponding to object, part, object attributes, and part attributes to rank object bounding boxes
- Combined scores using geometric mean to get one final score for each box
- Results show L1 > L3 > L2 due to trade-off between two opposing factors
- Ablation studies measure importance of different object and part attributes
- Trained two mask R-CNN and two ViT-det models with 531 classes
- Evaluated publicly available models from Detic and MDETR without further fine-tuning
- Results show limited performance for evaluated models
- PACO-EGO4D can serve as useful dataset for few-shot instance detection
- Benchmarked naive 2-stage model for k ranging from 1-5 and compared to zero-shot model
- 20+ point gap between best zero-shot model and one-shot model, gap widens as k increases
Conclusion
- Introduced PACO, a dataset designed to enable research towards joint detection of objects, parts and attributes of common objects
- 75 common object categories spanning both image and video datasets
- 3 benchmark tasks which showcase unique challenges in the dataset
- Manually defined parts with sample reference images
- 5 attribute types: color, shape, reflectance, materials and patterns & marking
- 98% of object instance pairs could be distinguished only using the 55 attributes included in PACO
- Color is the biggest discriminative attribute type
- Instances annotated with unique instance IDs
- Distribution of instances across 75 object categories in PACO-LVIS and PACO-EGO4D
- Models used to train the joint segmentation and attribute prediction models
- Evaluated models on the test splits for both the datasets
- Explored the effect of joint training on multiple tasks on object segmentation results
- Calculated lower and upper bounds on AP obj att
- Part association process to associate part boxes to objects