Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Image segmentation is usually addressed by training a model for a fixed set of object classes
  • System proposed can generate image segmentations based on arbitrary prompts at test time
  • Prompts can be either text or an image
  • Unified model trained once for three common segmentation tasks
  • System generates binary segmentation map for an image based on a free-text prompt or an additional image

Paper Content

Introduction

  • Ability to generalize to unseen data is important for AI applications
  • Humans excel at this task, but it is challenging for computer vision systems
  • Image segmentation requires predicting what and where it can be found
  • Classical models are limited to segmenting seen categories
  • Different approaches have emerged to extend this setting
  • CLIPSeg model can address all three tasks and process prompts in text or image form
  • Modern vision systems are commonly pretrained on a large-scale dataset
  • CLIP is a foundation model that has demonstrated excellent performance on image classification tasks
  • Transformers have been used for segmentation in TransUNet, SETR, Segformer, and Segmentor
  • CLIPSeg extends CLIP with a transformer-based decoder

Referring expression segmentation

  • Evaluated referring expression segmentation performance on PhraseCut dataset
  • Compared to scores reported by Wu et al. and MDETR
  • Trained CLIPSeg on PhraseCut dataset with text labels and visual samples
  • Outperformed two-stage HULANet approach by Wu et al.
  • Performance worse than MDETR
  • ViTSeg baseline performs worse than CLIPSeg
  • Zero-shot segmentation goal is to segment objects of unseen categories
  • Several methods address bias favoring seen classes
  • One-shot semantic segmentation task introduced in 2017
  • Several approaches focus on modeling prototypes
  • Multiple derivative works of CLIP across different sub-fields

Clipseg method

  • We use a visual transformer-based model (ViT-B/16) as a backbone for segmentation.
  • We extend the model with a small, parameter-efficient transformer decoder.
  • We use skip connections to the CLIP encoder to keep the decoder compact.
  • We modulate the decoder’s input activation with a conditional vector.
  • We interpolate between image and text embeddings as a data augmentation strategy.

Phrasecut + visual prompts (pc+)

  • PhraseCut dataset contains 340,000 phrases with corresponding image segmentations
  • Extended dataset (PhraseCut+) includes visual support samples and negative samples
  • Negative samples replace phrase with different phrase with probability q neg
  • Phrases augmented randomly using fixed prefixes
  • Images randomly cropped under consideration of object locations
  • PhraseCut+ supports training using image-text interpolation

Visual prompt engineering

  • Conventional CNN-based one-shot semantic segmentation uses masked pooling to compute a prototype vector.
  • Transformer-based architectures cannot use this method directly.
  • A simple experiment was conducted to learn how target information can be incorporated into CLIP.
  • The experiment compared the cosine distance between visual and text-based embeddings.
  • Three image operations were identified to improve the alignment between the object text prompts and the images: decreasing background brightness, blurring the background, and cropping to the object.
  • The combination of all three performed best.

Experiments

  • Evaluated model on three established segmentation benchmarks
  • Model can be based on either text or image prompts
  • Model trained to generate binary predictions to indicate where objects matching the query are located
  • Metrics used: IoU, mean IoU, binary IoU, AP
  • Two baselines: CLIP-Deconv and ViTSeg
  • PyTorch used for training

Generalized zero-shot segmentation

  • Generalized zero-shot segmentation involves categories that have never been seen before.
  • Evaluated using Pascal-VOC benchmark with 2-10 unseen classes.
  • Model trained on foreground/background segmentation, adapted to multi-label setting.
  • Pascal classes removed from dataset by assigning to WordNet synsets and removing prompts with invalid words.
  • Results indicate gap between seen and unseen classes, model performs better on unseen classes.

One-shot semantic segmentation

  • Single example image and mask presented to network
  • Regions of class highlighted in example image must be found in query image
  • Cannot rely on text label, must understand support image

Ablation study

  • Conducted an ablation study on PhraseCut
  • Evaluated text-based and visual prompt-based performance separately
  • Performance drops when random weights instead of CLIP weights are used
  • Performance decreases substantially when number of parameters is reduced to 16

Conclusion

  • CLIPSeg image segmentation approach can be adapted to new tasks with text or image prompts
  • Demonstrated competitive performance on referring expression, zero-shot and one-shot image segmentation tasks
  • Model generalizes to novel prompts involving affordances and properties
  • Tackling multiple tasks is a promising direction for future research
  • Experiments limited to small number of benchmarks
  • Depend on large-scale dataset for pre-training
  • Model focuses on images, not video
  • Image size may vary within certain limits
  • Comparison of different text prompts, object sizes and object classes
  • Model performs better on larger objects
  • Performance over different classes is fairly balanced
  • Model enables adaptation to new tasks without energy-intensive training