Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Contrastive loss is used to learn representations from multiple modalities.
Exact modality alignment is not optimal for downstream prediction tasks.
Three approaches are proposed to construct latent modality structures.
Experiments are conducted on two multi-modal representation learning frameworks.
Method achieves consistent improvements over existing methods.

Paper Content

Introduction

Aim to learn generic representations from images and texts
Unify representations of two modalities in one encoder
Represent image and text modality separately with modality-specific encoders
Utilize contrastive learning to align modalities
Modality gap defined as distance between feature distributions of two modalities
Contrastive learning does not always reduce modality gap
Theoretically study modality gap problem
Propose regularizations to construct better latent structures
Intra-modality, inter-modality, and intra-inter-modality regularizations

Unified models process both images and texts
Separate encoders for images and texts used in second category
Contrastive loss used to align multiple modalities
Third category uses separate encoders and late-fusion multi-modal encoder

Understanding the impact of modality gap on downstream performance

Modality alignment in feature space through contrastive learning is an open question
Notation: X T and X V denote input texts and images, Y denotes target variable
Modality gap problem is formally formulated
Relationship between modality gap and downstream performance is presented
Information-theoretical analysis is provided
Conditional entropy and cross-entropy loss are related

Empirical analysis on modality gap

Contrastive pretraining is used to align paired multimodal data in the feature space
Positive pairs are aligned to be closer together, while negative pairs are farther away
Experiments are conducted to explore the effect of reducing modality gap on image/text retrieval
Alignment loss is optimized during training to reduce the gap between modalities
Retrieval performance barely changes when changing the gap between two modalities

An information-theoretic analysis on modality gap

Inspired by empirical observation, reducing modality gap in feature space does not always lead to better downstream task performance
Defined information gap to characterize gap of utility provided by two modalities towards predicting target variable
Information gap only depends on joint distribution and is independent of modality encoders
Information gap serves as lower bound of downstream prediction error if seeking to find features with zero modality gap
Theorem 3.1 states that if information gap is large, optimal prediction error is at least ∆ p larger than input modalities

Method

Propose three designs to construct latent modality structures
Visualize designs in Fig. 3
Adopt multi-modal training framework with contrastive loss

Intra-modality regularization via deep feature separation

Aim to construct intra-modality structures to regularize in-modality representations
Define two types of information: modality-shared and modality-independent
Sub-optimal to match exact modality, so propose to explicitly model modality-independent information
Use feature separation to construct new features to store information
Constrain features to be orthogonal to original features
Adopt contrastive and uniformity loss on independent features
Preserve both modality-shared and modality-independent information

Inter-modality regularization via brownian bridge

Regularizing inter-modality structures by constraining paired modality features in a subspace
Constructing a latent structure to guide transition from image modality to associated text modality
Modeling transition with Brownian bridge, which is a special type of Brownian motion
Aligning augmented image feature with mean of Brownian bridge to fit the model

Intra-inter regularization via geometric consistency

Aim to design a general regularizer that considers both intra-and inter-modality structures
Enforce geometric symmetry within and between modality representations and their augmentations
Optimize similarity between mismatched image and text pairs, and between image pairs and text pairs
Optimize geometric symmetry between feature pairs and augmented feature pairs in the text and image space

Experiments

Proposed methods are general purpose
Evaluated with two popular multi-modal representation frameworks: two-tower based models and fusion based models

Two-tower-based models

CLIP-based models used with two separate encoders to align features from image and text modalities
Regularization losses applied along with standard contrastive loss for pre-training
ResNet-50 and BERT used as image and text encoders respectively
Reproduced CLIP results consistent with recent works
Union of four datasets used for pre-training
Zero-shot transfer on standard image classification tasks with CIFAR10, CIFAR100 and ImageNet1K
Text prompts constructed using class name
Results show importance of latent modality structures
Performance degradation on natural distribution shift benchmarks
All methods outperform baselines on linear probing tasks

Fusion-based models

Tested methods on fusion-based models
Used ALBEF framework to fuse modalities
Fusion-based models are better at learning inter-model interaction
Evaluated methods on various vision-language downstream tasks
Used ViT-B/16 as vision encoder and 12layer BERT base as text encoder
Results show 1-2% improvement on test sets

Conclusion

Investigated latent modality structures in multi-modal representation learning
Analyzed modality gap in latent feature space
Revealed reducing modality gap to zero does not always lead to better performance
Proposed three regularization methods to construct meaningful latent structures
Deep feature separation loss
Brownian bridge loss
Geometric consistency loss
Confirmed effectiveness and generalizability of proposed approach on popular contrastive representation learning frameworks
Theorem 3.1 states that if modality encoders are perfectly aligned in feature space, joint mutual information is maximized
Visualized constructing latent structures
Visualized two-tower-based and fusion-based models
Additional experimental results show all three regularizations improve performance

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Understanding the impact of modality gap on downstream performance#

Empirical analysis on modality gap#

An information-theoretic analysis on modality gap#

Method#

Intra-modality regularization via deep feature separation#

Inter-modality regularization via brownian bridge#

Intra-inter regularization via geometric consistency#

Experiments#

Two-tower-based models#

Fusion-based models#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Understanding the impact of modality gap on downstream performance

Empirical analysis on modality gap

An information-theoretic analysis on modality gap

Method

Intra-modality regularization via deep feature separation

Inter-modality regularization via brownian bridge

Intra-inter regularization via geometric consistency

Experiments

Two-tower-based models

Fusion-based models

Conclusion