Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Cut-and-LEaRn (CutLER) is a computer science approach for training unsupervised object detection and segmentation models.
- CutLER uses a MaskCut approach to generate coarse masks for multiple objects in an image and then learns a detector on these masks.
- CutLER is simpler, compatible with different detection architectures, and detects multiple objects.
- CutLER improves detection performance AP50 by over 2.7 times on 11 benchmarks.
- With finetuning, CutLER serves as a low-shot detector surpassing MoCo-v2 by 7.3% APbox and 6.6% APmask on COCO when training with 5% labels.
Paper Content
Introduction
- Object localization is a critical task in computer vision
- Training models for localization require special annotations
- Annotating images in the COCO dataset with masks took 28K human hours
- Cut-and-LEaRn (CutLER) is an unsupervised object detection and instance segmentation model
- CutLER is trained exclusively on unlabeled ImageNet data
- CutLER consists of three simple, architecture-and data-agnostic mechanisms
- CutLER can be directly employed to perform complex segmentation and detection tasks
- CutLER is simple to train and agnostic to the choice of detection and backbone architectures
- CutLER trained solely on ImageNet shows strong zero-shot performance on 11 different benchmarks
- CutLER exhibits strong robustness against domain shifts
Related work
Method
Preliminaries
- NCut is a graph partitioning task used for image segmentation
- A graph is constructed with nodes representing images and edges connecting nodes with weights measuring similarity
- NCut minimizes the cost of partitioning the graph into two sub-graphs
- DINO and TokenCut use the similarity of image patches in the DINO feature space as the similarity weight for NCut
- TokenCut only computes a single binary mask for an image and thus only finds one object per image
Maskcut for discovering multiple objects
- Vanilla NCut is limited to discovering a single object in an image
- Mask-Cut extends NCut to discover multiple objects per image
- To determine which group corresponds to the foreground, two criteria are used: 1) the foreground patches should be more prominent than background patches and 2) the foreground set should contain less than two of the four corners
- MaskCut creates a patch-wise similarity matrix for the image using a self-supervised DINO model’s features
- MaskCut applies Normalized Cuts to the matrix and obtains a single foreground object mask of the image
- MaskCut masks out the affinity matrix values using the foreground mask and repeats the process to discover multiple object masks in a single image
Droploss for exploring image regions
- Standard detection loss penalizes predicted regions that do not overlap with the ‘ground-truth’.
- Standard loss does not enable the detector to discover new instances not labeled in the ‘ground-truth’.
- During training, loss is dropped for predicted regions with a maximum overlap of τ IoU with any of the ‘ground-truth’ instances.
- Low threshold τ IoU = 0.01 is used.
Multi-round self-training
- MaskCut produces coarse masks which are used to train detection models
- Detection models refine mask quality and DropLoss encourages them to discover new object masks
- Self-training is used to improve detector performance, using predicted masks and proposals with a confidence score over 0.75−0.5t
- Ground-truth masks with an IoU > 0.5 with the predicted masks are filtered out
- Three rounds of self-training are sufficient to obtain good performance
Implementation details
- Training data is from ImageNet dataset
- MaskCut uses ViT-B/8 and DINO model
- CRF used to post-process masks and compute bounding boxes
- Detector is Mask R-CNN or Cascade Mask R-CNN
- Copy-paste augmentation used during training
- Self-training used to initialize detection model
- CutLER outperforms previous SOTA method
Experiments
- CutLER is evaluated on various detection and segmentation benchmarks
- CutLER can discover objects without any supervision on unseen images
- CutLER outperforms prior methods that use in-domain training data
- Finetuning CutLER further improves detection performance, outperforming prior work
Unsupervised zero-shot evaluations
- CutLER is a universal unsupervised object detection and segmentation method
- CutLER is trained solely using images from ImageNet
- CutLER is evaluated in a zero-shot manner on all downstream datasets
- CutLER is evaluated using class-agnostic detection evaluation
- CutLER significantly outperforms prior work on 11 benchmarks
- CutLER improves performance by over 4x and 2x on different domains
Label-efficient and fully-supervised learning
- CutLER is a pretraining method for training object detection and instance segmentation models
- CutLER can discover objects without any supervision
- Finetuning CutLER on a target dataset aligns the model output to the same set of objects labeled in the dataset
- We use CutLER to initialize a standard Cascade Mask R-CNN with a ResNet50
- We train the detector on the COCO dataset using the bounding box and instance mask labels
- We subsample the training set to create subsets with varying proportions of labeled images
- We train the detector, initialized with CutLER, on each of these subsets
- We use MoCo-v2 as a baseline
- CutLER outperforms FreeSOLO and DETReg on this benchmark
- We analyze the design decisions in CutLER
- We use similar settings to train CutLER only on ImageNet
- We evaluate our model primarily on the COCO and UVO unsupervised detection benchmarks
- We study the impact of each component of CutLER
- We compare CutLER to TokenCut
- We understand the impact of the number of masks per image generated by MaskCut
- We vary the IOU threshold used for DropLoss
- We analyze the impact of self-training
- We use different detector architectures for training CutLER and measure their performance
- We study the impact of the dataset used for pretraining the self-supervised DINO model and training the CutLER model
- We evaluate CutLER on 11 benchmarks across various domains