Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Contrastive loss is used to learn representations from multiple modalities.
  • Exact modality alignment is not optimal for downstream prediction tasks.
  • Three approaches are proposed to construct latent modality structures.
  • Experiments are conducted on two multi-modal representation learning frameworks.
  • Method achieves consistent improvements over existing methods.

Paper Content

Introduction

  • Aim to learn generic representations from images and texts
  • Unify representations of two modalities in one encoder
  • Represent image and text modality separately with modality-specific encoders
  • Utilize contrastive learning to align modalities
  • Modality gap defined as distance between feature distributions of two modalities
  • Contrastive learning does not always reduce modality gap
  • Theoretically study modality gap problem
  • Propose regularizations to construct better latent structures
  • Intra-modality, inter-modality, and intra-inter-modality regularizations
  • Unified models process both images and texts
  • Separate encoders for images and texts used in second category
  • Contrastive loss used to align multiple modalities
  • Third category uses separate encoders and late-fusion multi-modal encoder

Understanding the impact of modality gap on downstream performance

  • Modality alignment in feature space through contrastive learning is an open question
  • Notation: X T and X V denote input texts and images, Y denotes target variable
  • Modality gap problem is formally formulated
  • Relationship between modality gap and downstream performance is presented
  • Information-theoretical analysis is provided
  • Conditional entropy and cross-entropy loss are related

Empirical analysis on modality gap

  • Contrastive pretraining is used to align paired multimodal data in the feature space
  • Positive pairs are aligned to be closer together, while negative pairs are farther away
  • Experiments are conducted to explore the effect of reducing modality gap on image/text retrieval
  • Alignment loss is optimized during training to reduce the gap between modalities
  • Retrieval performance barely changes when changing the gap between two modalities

An information-theoretic analysis on modality gap

  • Inspired by empirical observation, reducing modality gap in feature space does not always lead to better downstream task performance
  • Defined information gap to characterize gap of utility provided by two modalities towards predicting target variable
  • Information gap only depends on joint distribution and is independent of modality encoders
  • Information gap serves as lower bound of downstream prediction error if seeking to find features with zero modality gap
  • Theorem 3.1 states that if information gap is large, optimal prediction error is at least โˆ† p larger than input modalities

Method

  • Propose three designs to construct latent modality structures
  • Visualize designs in Fig. 3
  • Adopt multi-modal training framework with contrastive loss

Intra-modality regularization via deep feature separation

  • Aim to construct intra-modality structures to regularize in-modality representations
  • Define two types of information: modality-shared and modality-independent
  • Sub-optimal to match exact modality, so propose to explicitly model modality-independent information
  • Use feature separation to construct new features to store information
  • Constrain features to be orthogonal to original features
  • Adopt contrastive and uniformity loss on independent features
  • Preserve both modality-shared and modality-independent information

Inter-modality regularization via brownian bridge

  • Regularizing inter-modality structures by constraining paired modality features in a subspace
  • Constructing a latent structure to guide transition from image modality to associated text modality
  • Modeling transition with Brownian bridge, which is a special type of Brownian motion
  • Aligning augmented image feature with mean of Brownian bridge to fit the model

Intra-inter regularization via geometric consistency

  • Aim to design a general regularizer that considers both intra-and inter-modality structures
  • Enforce geometric symmetry within and between modality representations and their augmentations
  • Optimize similarity between mismatched image and text pairs, and between image pairs and text pairs
  • Optimize geometric symmetry between feature pairs and augmented feature pairs in the text and image space

Experiments

  • Proposed methods are general purpose
  • Evaluated with two popular multi-modal representation frameworks: two-tower based models and fusion based models

Two-tower-based models

  • CLIP-based models used with two separate encoders to align features from image and text modalities
  • Regularization losses applied along with standard contrastive loss for pre-training
  • ResNet-50 and BERT used as image and text encoders respectively
  • Reproduced CLIP results consistent with recent works
  • Union of four datasets used for pre-training
  • Zero-shot transfer on standard image classification tasks with CIFAR10, CIFAR100 and ImageNet1K
  • Text prompts constructed using class name
  • Results show importance of latent modality structures
  • Performance degradation on natural distribution shift benchmarks
  • All methods outperform baselines on linear probing tasks

Fusion-based models

  • Tested methods on fusion-based models
  • Used ALBEF framework to fuse modalities
  • Fusion-based models are better at learning inter-model interaction
  • Evaluated methods on various vision-language downstream tasks
  • Used ViT-B/16 as vision encoder and 12layer BERT base as text encoder
  • Results show 1-2% improvement on test sets

Conclusion

  • Investigated latent modality structures in multi-modal representation learning
  • Analyzed modality gap in latent feature space
  • Revealed reducing modality gap to zero does not always lead to better performance
  • Proposed three regularization methods to construct meaningful latent structures
  • Deep feature separation loss
  • Brownian bridge loss
  • Geometric consistency loss
  • Confirmed effectiveness and generalizability of proposed approach on popular contrastive representation learning frameworks
  • Theorem 3.1 states that if modality encoders are perfectly aligned in feature space, joint mutual information is maximized
  • Visualized constructing latent structures
  • Visualized two-tower-based and fusion-based models
  • Additional experimental results show all three regularizations improve performance