Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Semantic segmentation is a challenging task for parsing different contexts.
  • A context-aware classifier is used to adapt to different latent distributions.
  • The method is model-agnostic and can be applied to generic segmentation models.
  • With negligible additional parameters and +2% inference time, decent performance gain is achieved.

Paper Content

Introduction

  • Semantic segmentation has been used in a wide range of applications
  • Recent advances in model structure focus on strong backbones and decoder heads
  • Classifier in recent literature is composed of shared parameters for all images
  • This can lead to difficulty in handling diverse contexts
  • Enriching classifier with contextual information can improve performance
  • Entropy-aware KL loss is designed to mitigate information imbalance
  • Method can be plugged into existing segmentation models with little efficiency compensation
  • Semantic segmentation is a challenging task that requires precise pixel-wise predictions.
  • Contextual information is used to improve performance.
  • FCN proposed convolution layers for the task.
  • Decoders are used to up-sample encoded features.
  • Receptive field is increased with dilated convolutions, global pooling and pyramid pooling.
  • Pixel and region contrasts are exploited.
  • Transformer is used to model long-range relationships.
  • Transformers are used as feature encoders and decoder heads.

Motivation

  • A generic deep model is composed of two modules: a feature generator and a classifier.
  • The feature generator takes an input image and projects it into a high-dimensional feature.
  • The classifier uses the feature to make pixel-wise predictions.
  • The classifier should be “context-aware” to different samples to improve performance.

Is context-aware classifier necessary?

  • Contextual cues can be mined from extracted features to enrich the classifier.
  • A case study is conducted to verify the hypothesis that the proposed context-aware classifier is conducive to the model performance.
  • A lightweight projector is used to combine the categorical prototypes and the vanilla classifier to form the oracle context-aware classifier.

Learning context-aware classifier

  • Without ground-truth labels, prediction p is used to approximate the oracle contextual prior
  • Prediction p is the result of the original classifier
  • Estimated contextual prototypes C p are yielded with p
  • Context-aware classifier A p is yielded by processing the concatenation of the estimated contextual prior C p and the original classifier C
  • Prediction p p is the result of the temporarily estimated context-aware classifier A p
  • Cosine similarities are used to calculate the contextual prototypes
  • Pixel-wise cross-entropy (CE) loss L ce p is used to supervise p p
  • KL divergence L KL is incorporated to regularize the model
  • Entropy H is calculated to adjust the contribution of each element
  • Self-attention (SA) dynamically adapts to different inputs

Experiments

Implementation

  • Adopted two challenging semantic segmentation benchmarks
  • Trained and evaluated models on training and validation sets
  • Results of Cityscapes and Pascal-Context shown in supplementary
  • Investigated convolution-based and transformer-based models
  • Reported single-scale and multi-scale results

Results

  • Verified effectiveness and generalization ability of proposed method with various decoder heads and backbones
  • Decent performance gain achieved on ADE20K and COCO-Stuff 164K
  • Improvement not originated from newly introduced parameters
  • Predicted masks more visually attractive

Ablation study

  • Experimental results are presented to investigate the effectiveness of each component of the proposed method
  • Ablation study conducted on ADE20K with UperNet and Swin-Tiny as baseline model
  • L ce supervises the original classifier’s prediction p
  • Supervisions on p and A y are both essential
  • Without L ce , L ce y and L KL , merely supervising p p even worsens the baseline’s performance
  • Comparison between (b) and (c) tells the importance of L ce
  • Necessities of L KL and L ce y
  • Vanilla KL loss achieves comparable to Exp. (a) without KL loss
  • Entropy-based KL in Exp. (c) incrementally improves Exp. (a)
  • Class-wise calculation to (b) helps alleviate the imbalance between different classes
  • Incorporating the entropy estimation with the class-wise KL in Exp. (e) achieves persuasive performance
  • Oracle predictions p y are more favorable than p p (Est.) and p (Ori.) for estimating the entropy mask
  • Cosine similarity focuses on the angle between two vectors, while dot product considers both angle and magnitudes
  • Context-aware classifier implemented with cosine similarity achieves favorable results
  • Scaling factor τ set to 15 in all experiments
  • Impacts brought to the model efficiency are minor negative impacts

Concluding remarks

  • Learning context-aware classifier captures and leverages useful contextual information in different samples
  • Improves performance by forming specific descriptors for individual latent distributions
  • Entropy-aware distillation loss proposed to better mine informative hints
  • Easily applied to generic segmentation models
  • Boosts both small and large models with favorable improvements
  • Results show effectiveness of proposed design and alleviation of instability issues