Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Image segmentation is usually addressed by training a model for a fixed set of object classes
System proposed can generate image segmentations based on arbitrary prompts at test time
Prompts can be either text or an image
Unified model trained once for three common segmentation tasks
System generates binary segmentation map for an image based on a free-text prompt or an additional image

Paper Content

Introduction

Ability to generalize to unseen data is important for AI applications
Humans excel at this task, but it is challenging for computer vision systems
Image segmentation requires predicting what and where it can be found
Classical models are limited to segmenting seen categories
Different approaches have emerged to extend this setting
CLIPSeg model can address all three tasks and process prompts in text or image form

Modern vision systems are commonly pretrained on a large-scale dataset
CLIP is a foundation model that has demonstrated excellent performance on image classification tasks
Transformers have been used for segmentation in TransUNet, SETR, Segformer, and Segmentor
CLIPSeg extends CLIP with a transformer-based decoder

Referring expression segmentation

Evaluated referring expression segmentation performance on PhraseCut dataset
Compared to scores reported by Wu et al. and MDETR
Trained CLIPSeg on PhraseCut dataset with text labels and visual samples
Outperformed two-stage HULANet approach by Wu et al.
Performance worse than MDETR
ViTSeg baseline performs worse than CLIPSeg
Zero-shot segmentation goal is to segment objects of unseen categories
Several methods address bias favoring seen classes
One-shot semantic segmentation task introduced in 2017
Several approaches focus on modeling prototypes
Multiple derivative works of CLIP across different sub-fields

Clipseg method

We use a visual transformer-based model (ViT-B/16) as a backbone for segmentation.
We extend the model with a small, parameter-efficient transformer decoder.
We use skip connections to the CLIP encoder to keep the decoder compact.
We modulate the decoder’s input activation with a conditional vector.
We interpolate between image and text embeddings as a data augmentation strategy.

Phrasecut + visual prompts (pc+)

PhraseCut dataset contains 340,000 phrases with corresponding image segmentations
Extended dataset (PhraseCut+) includes visual support samples and negative samples
Negative samples replace phrase with different phrase with probability q neg
Phrases augmented randomly using fixed prefixes
Images randomly cropped under consideration of object locations
PhraseCut+ supports training using image-text interpolation

Visual prompt engineering

Conventional CNN-based one-shot semantic segmentation uses masked pooling to compute a prototype vector.
Transformer-based architectures cannot use this method directly.
A simple experiment was conducted to learn how target information can be incorporated into CLIP.
The experiment compared the cosine distance between visual and text-based embeddings.
Three image operations were identified to improve the alignment between the object text prompts and the images: decreasing background brightness, blurring the background, and cropping to the object.
The combination of all three performed best.

Experiments

Evaluated model on three established segmentation benchmarks
Model can be based on either text or image prompts
Model trained to generate binary predictions to indicate where objects matching the query are located
Metrics used: IoU, mean IoU, binary IoU, AP
Two baselines: CLIP-Deconv and ViTSeg
PyTorch used for training

Generalized zero-shot segmentation

Generalized zero-shot segmentation involves categories that have never been seen before.
Evaluated using Pascal-VOC benchmark with 2-10 unseen classes.
Model trained on foreground/background segmentation, adapted to multi-label setting.
Pascal classes removed from dataset by assigning to WordNet synsets and removing prompts with invalid words.
Results indicate gap between seen and unseen classes, model performs better on unseen classes.

One-shot semantic segmentation

Single example image and mask presented to network
Regions of class highlighted in example image must be found in query image
Cannot rely on text label, must understand support image

Ablation study

Conducted an ablation study on PhraseCut
Evaluated text-based and visual prompt-based performance separately
Performance drops when random weights instead of CLIP weights are used
Performance decreases substantially when number of parameters is reduced to 16

Conclusion

CLIPSeg image segmentation approach can be adapted to new tasks with text or image prompts
Demonstrated competitive performance on referring expression, zero-shot and one-shot image segmentation tasks
Model generalizes to novel prompts involving affordances and properties
Tackling multiple tasks is a promising direction for future research
Experiments limited to small number of benchmarks
Depend on large-scale dataset for pre-training
Model focuses on images, not video
Image size may vary within certain limits
Comparison of different text prompts, object sizes and object classes
Model performs better on larger objects
Performance over different classes is fairly balanced
Model enables adaptation to new tasks without energy-intensive training

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Referring expression segmentation#

Clipseg method#

Phrasecut + visual prompts (pc+)#

Visual prompt engineering#

Experiments#

Generalized zero-shot segmentation#

One-shot semantic segmentation#

Ablation study#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Referring expression segmentation

Clipseg method

Phrasecut + visual prompts (pc+)

Visual prompt engineering

Experiments

Generalized zero-shot segmentation

One-shot semantic segmentation

Ablation study

Conclusion