Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Semantic segmentation is a challenging task for parsing different contexts.
- A context-aware classifier is used to adapt to different latent distributions.
- The method is model-agnostic and can be applied to generic segmentation models.
- With negligible additional parameters and +2% inference time, decent performance gain is achieved.
Paper Content
Introduction
- Semantic segmentation has been used in a wide range of applications
- Recent advances in model structure focus on strong backbones and decoder heads
- Classifier in recent literature is composed of shared parameters for all images
- This can lead to difficulty in handling diverse contexts
- Enriching classifier with contextual information can improve performance
- Entropy-aware KL loss is designed to mitigate information imbalance
- Method can be plugged into existing segmentation models with little efficiency compensation
Related work
- Semantic segmentation is a challenging task that requires precise pixel-wise predictions.
- Contextual information is used to improve performance.
- FCN proposed convolution layers for the task.
- Decoders are used to up-sample encoded features.
- Receptive field is increased with dilated convolutions, global pooling and pyramid pooling.
- Pixel and region contrasts are exploited.
- Transformer is used to model long-range relationships.
- Transformers are used as feature encoders and decoder heads.
Motivation
- A generic deep model is composed of two modules: a feature generator and a classifier.
- The feature generator takes an input image and projects it into a high-dimensional feature.
- The classifier uses the feature to make pixel-wise predictions.
- The classifier should be “context-aware” to different samples to improve performance.
Is context-aware classifier necessary?
- Contextual cues can be mined from extracted features to enrich the classifier.
- A case study is conducted to verify the hypothesis that the proposed context-aware classifier is conducive to the model performance.
- A lightweight projector is used to combine the categorical prototypes and the vanilla classifier to form the oracle context-aware classifier.
Learning context-aware classifier
- Without ground-truth labels, prediction p is used to approximate the oracle contextual prior
- Prediction p is the result of the original classifier
- Estimated contextual prototypes C p are yielded with p
- Context-aware classifier A p is yielded by processing the concatenation of the estimated contextual prior C p and the original classifier C
- Prediction p p is the result of the temporarily estimated context-aware classifier A p
- Cosine similarities are used to calculate the contextual prototypes
- Pixel-wise cross-entropy (CE) loss L ce p is used to supervise p p
- KL divergence L KL is incorporated to regularize the model
- Entropy H is calculated to adjust the contribution of each element
- Self-attention (SA) dynamically adapts to different inputs
Experiments
Implementation
- Adopted two challenging semantic segmentation benchmarks
- Trained and evaluated models on training and validation sets
- Results of Cityscapes and Pascal-Context shown in supplementary
- Investigated convolution-based and transformer-based models
- Reported single-scale and multi-scale results
Results
- Verified effectiveness and generalization ability of proposed method with various decoder heads and backbones
- Decent performance gain achieved on ADE20K and COCO-Stuff 164K
- Improvement not originated from newly introduced parameters
- Predicted masks more visually attractive
Ablation study
- Experimental results are presented to investigate the effectiveness of each component of the proposed method
- Ablation study conducted on ADE20K with UperNet and Swin-Tiny as baseline model
- L ce supervises the original classifier’s prediction p
- Supervisions on p and A y are both essential
- Without L ce , L ce y and L KL , merely supervising p p even worsens the baseline’s performance
- Comparison between (b) and (c) tells the importance of L ce
- Necessities of L KL and L ce y
- Vanilla KL loss achieves comparable to Exp. (a) without KL loss
- Entropy-based KL in Exp. (c) incrementally improves Exp. (a)
- Class-wise calculation to (b) helps alleviate the imbalance between different classes
- Incorporating the entropy estimation with the class-wise KL in Exp. (e) achieves persuasive performance
- Oracle predictions p y are more favorable than p p (Est.) and p (Ori.) for estimating the entropy mask
- Cosine similarity focuses on the angle between two vectors, while dot product considers both angle and magnitudes
- Context-aware classifier implemented with cosine similarity achieves favorable results
- Scaling factor τ set to 15 in all experiments
- Impacts brought to the model efficiency are minor negative impacts
Concluding remarks
- Learning context-aware classifier captures and leverages useful contextual information in different samples
- Improves performance by forming specific descriptors for individual latent distributions
- Entropy-aware distillation loss proposed to better mine informative hints
- Easily applied to generic segmentation models
- Boosts both small and large models with favorable improvements
- Results show effectiveness of proposed design and alleviation of instability issues