Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Image segmentation is usually addressed by training a model for a fixed set of object classes
- System proposed can generate image segmentations based on arbitrary prompts at test time
- Prompts can be either text or an image
- Unified model trained once for three common segmentation tasks
- System generates binary segmentation map for an image based on a free-text prompt or an additional image
Paper Content
Introduction
- Ability to generalize to unseen data is important for AI applications
- Humans excel at this task, but it is challenging for computer vision systems
- Image segmentation requires predicting what and where it can be found
- Classical models are limited to segmenting seen categories
- Different approaches have emerged to extend this setting
- CLIPSeg model can address all three tasks and process prompts in text or image form
Related work
- Modern vision systems are commonly pretrained on a large-scale dataset
- CLIP is a foundation model that has demonstrated excellent performance on image classification tasks
- Transformers have been used for segmentation in TransUNet, SETR, Segformer, and Segmentor
- CLIPSeg extends CLIP with a transformer-based decoder
Referring expression segmentation
- Evaluated referring expression segmentation performance on PhraseCut dataset
- Compared to scores reported by Wu et al. and MDETR
- Trained CLIPSeg on PhraseCut dataset with text labels and visual samples
- Outperformed two-stage HULANet approach by Wu et al.
- Performance worse than MDETR
- ViTSeg baseline performs worse than CLIPSeg
- Zero-shot segmentation goal is to segment objects of unseen categories
- Several methods address bias favoring seen classes
- One-shot semantic segmentation task introduced in 2017
- Several approaches focus on modeling prototypes
- Multiple derivative works of CLIP across different sub-fields
Clipseg method
- We use a visual transformer-based model (ViT-B/16) as a backbone for segmentation.
- We extend the model with a small, parameter-efficient transformer decoder.
- We use skip connections to the CLIP encoder to keep the decoder compact.
- We modulate the decoder’s input activation with a conditional vector.
- We interpolate between image and text embeddings as a data augmentation strategy.
Phrasecut + visual prompts (pc+)
- PhraseCut dataset contains 340,000 phrases with corresponding image segmentations
- Extended dataset (PhraseCut+) includes visual support samples and negative samples
- Negative samples replace phrase with different phrase with probability q neg
- Phrases augmented randomly using fixed prefixes
- Images randomly cropped under consideration of object locations
- PhraseCut+ supports training using image-text interpolation
Visual prompt engineering
- Conventional CNN-based one-shot semantic segmentation uses masked pooling to compute a prototype vector.
- Transformer-based architectures cannot use this method directly.
- A simple experiment was conducted to learn how target information can be incorporated into CLIP.
- The experiment compared the cosine distance between visual and text-based embeddings.
- Three image operations were identified to improve the alignment between the object text prompts and the images: decreasing background brightness, blurring the background, and cropping to the object.
- The combination of all three performed best.
Experiments
- Evaluated model on three established segmentation benchmarks
- Model can be based on either text or image prompts
- Model trained to generate binary predictions to indicate where objects matching the query are located
- Metrics used: IoU, mean IoU, binary IoU, AP
- Two baselines: CLIP-Deconv and ViTSeg
- PyTorch used for training
Generalized zero-shot segmentation
- Generalized zero-shot segmentation involves categories that have never been seen before.
- Evaluated using Pascal-VOC benchmark with 2-10 unseen classes.
- Model trained on foreground/background segmentation, adapted to multi-label setting.
- Pascal classes removed from dataset by assigning to WordNet synsets and removing prompts with invalid words.
- Results indicate gap between seen and unseen classes, model performs better on unseen classes.
One-shot semantic segmentation
- Single example image and mask presented to network
- Regions of class highlighted in example image must be found in query image
- Cannot rely on text label, must understand support image
Ablation study
- Conducted an ablation study on PhraseCut
- Evaluated text-based and visual prompt-based performance separately
- Performance drops when random weights instead of CLIP weights are used
- Performance decreases substantially when number of parameters is reduced to 16
Conclusion
- CLIPSeg image segmentation approach can be adapted to new tasks with text or image prompts
- Demonstrated competitive performance on referring expression, zero-shot and one-shot image segmentation tasks
- Model generalizes to novel prompts involving affordances and properties
- Tackling multiple tasks is a promising direction for future research
- Experiments limited to small number of benchmarks
- Depend on large-scale dataset for pre-training
- Model focuses on images, not video
- Image size may vary within certain limits
- Comparison of different text prompts, object sizes and object classes
- Model performs better on larger objects
- Performance over different classes is fairly balanced
- Model enables adaptation to new tasks without energy-intensive training