Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Cut-and-LEaRn (CutLER) is a computer science approach for training unsupervised object detection and segmentation models.
CutLER uses a MaskCut approach to generate coarse masks for multiple objects in an image and then learns a detector on these masks.
CutLER is simpler, compatible with different detection architectures, and detects multiple objects.
CutLER improves detection performance AP50 by over 2.7 times on 11 benchmarks.
With finetuning, CutLER serves as a low-shot detector surpassing MoCo-v2 by 7.3% APbox and 6.6% APmask on COCO when training with 5% labels.

Paper Content

Introduction

Object localization is a critical task in computer vision
Training models for localization require special annotations
Annotating images in the COCO dataset with masks took 28K human hours
Cut-and-LEaRn (CutLER) is an unsupervised object detection and instance segmentation model
CutLER is trained exclusively on unlabeled ImageNet data
CutLER consists of three simple, architecture-and data-agnostic mechanisms
CutLER can be directly employed to perform complex segmentation and detection tasks
CutLER is simple to train and agnostic to the choice of detection and backbone architectures
CutLER trained solely on ImageNet shows strong zero-shot performance on 11 different benchmarks
CutLER exhibits strong robustness against domain shifts

Method

Preliminaries

NCut is a graph partitioning task used for image segmentation
A graph is constructed with nodes representing images and edges connecting nodes with weights measuring similarity
NCut minimizes the cost of partitioning the graph into two sub-graphs
DINO and TokenCut use the similarity of image patches in the DINO feature space as the similarity weight for NCut
TokenCut only computes a single binary mask for an image and thus only finds one object per image

Maskcut for discovering multiple objects

Vanilla NCut is limited to discovering a single object in an image
Mask-Cut extends NCut to discover multiple objects per image
To determine which group corresponds to the foreground, two criteria are used: 1) the foreground patches should be more prominent than background patches and 2) the foreground set should contain less than two of the four corners
MaskCut creates a patch-wise similarity matrix for the image using a self-supervised DINO model’s features
MaskCut applies Normalized Cuts to the matrix and obtains a single foreground object mask of the image
MaskCut masks out the affinity matrix values using the foreground mask and repeats the process to discover multiple object masks in a single image

Droploss for exploring image regions

Standard detection loss penalizes predicted regions that do not overlap with the ‘ground-truth’.
Standard loss does not enable the detector to discover new instances not labeled in the ‘ground-truth’.
During training, loss is dropped for predicted regions with a maximum overlap of τ IoU with any of the ‘ground-truth’ instances.
Low threshold τ IoU = 0.01 is used.

Multi-round self-training

MaskCut produces coarse masks which are used to train detection models
Detection models refine mask quality and DropLoss encourages them to discover new object masks
Self-training is used to improve detector performance, using predicted masks and proposals with a confidence score over 0.75−0.5t
Ground-truth masks with an IoU > 0.5 with the predicted masks are filtered out
Three rounds of self-training are sufficient to obtain good performance

Implementation details

Training data is from ImageNet dataset
MaskCut uses ViT-B/8 and DINO model
CRF used to post-process masks and compute bounding boxes
Detector is Mask R-CNN or Cascade Mask R-CNN
Copy-paste augmentation used during training
Self-training used to initialize detection model
CutLER outperforms previous SOTA method

Experiments

CutLER is evaluated on various detection and segmentation benchmarks
CutLER can discover objects without any supervision on unseen images
CutLER outperforms prior methods that use in-domain training data
Finetuning CutLER further improves detection performance, outperforming prior work

Unsupervised zero-shot evaluations

CutLER is a universal unsupervised object detection and segmentation method
CutLER is trained solely using images from ImageNet
CutLER is evaluated in a zero-shot manner on all downstream datasets
CutLER is evaluated using class-agnostic detection evaluation
CutLER significantly outperforms prior work on 11 benchmarks
CutLER improves performance by over 4x and 2x on different domains

Label-efficient and fully-supervised learning

CutLER is a pretraining method for training object detection and instance segmentation models
CutLER can discover objects without any supervision
Finetuning CutLER on a target dataset aligns the model output to the same set of objects labeled in the dataset
We use CutLER to initialize a standard Cascade Mask R-CNN with a ResNet50
We train the detector on the COCO dataset using the bounding box and instance mask labels
We subsample the training set to create subsets with varying proportions of labeled images
We train the detector, initialized with CutLER, on each of these subsets
We use MoCo-v2 as a baseline
CutLER outperforms FreeSOLO and DETReg on this benchmark
We analyze the design decisions in CutLER
We use similar settings to train CutLER only on ImageNet
We evaluate our model primarily on the COCO and UVO unsupervised detection benchmarks
We study the impact of each component of CutLER
We compare CutLER to TokenCut
We understand the impact of the number of masks per image generated by MaskCut
We vary the IOU threshold used for DropLoss
We analyze the impact of self-training
We use different detector architectures for training CutLER and measure their performance
We study the impact of the dataset used for pretraining the self-supervised DINO model and training the CutLER model
We evaluate CutLER on 11 benchmarks across various domains

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

Preliminaries#

Maskcut for discovering multiple objects#

Droploss for exploring image regions#

Multi-round self-training#

Implementation details#

Experiments#

Unsupervised zero-shot evaluations#

Label-efficient and fully-supervised learning#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

Preliminaries

Maskcut for discovering multiple objects

Droploss for exploring image regions

Multi-round self-training

Implementation details

Experiments

Unsupervised zero-shot evaluations

Label-efficient and fully-supervised learning