Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Contrastive loss is used to learn representations from multiple modalities.
- Exact modality alignment is not optimal for downstream prediction tasks.
- Three approaches are proposed to construct latent modality structures.
- Experiments are conducted on two multi-modal representation learning frameworks.
- Method achieves consistent improvements over existing methods.
Paper Content
Introduction
- Aim to learn generic representations from images and texts
- Unify representations of two modalities in one encoder
- Represent image and text modality separately with modality-specific encoders
- Utilize contrastive learning to align modalities
- Modality gap defined as distance between feature distributions of two modalities
- Contrastive learning does not always reduce modality gap
- Theoretically study modality gap problem
- Propose regularizations to construct better latent structures
- Intra-modality, inter-modality, and intra-inter-modality regularizations
Related work
- Unified models process both images and texts
- Separate encoders for images and texts used in second category
- Contrastive loss used to align multiple modalities
- Third category uses separate encoders and late-fusion multi-modal encoder
Understanding the impact of modality gap on downstream performance
- Modality alignment in feature space through contrastive learning is an open question
- Notation: X T and X V denote input texts and images, Y denotes target variable
- Modality gap problem is formally formulated
- Relationship between modality gap and downstream performance is presented
- Information-theoretical analysis is provided
- Conditional entropy and cross-entropy loss are related
Empirical analysis on modality gap
- Contrastive pretraining is used to align paired multimodal data in the feature space
- Positive pairs are aligned to be closer together, while negative pairs are farther away
- Experiments are conducted to explore the effect of reducing modality gap on image/text retrieval
- Alignment loss is optimized during training to reduce the gap between modalities
- Retrieval performance barely changes when changing the gap between two modalities
An information-theoretic analysis on modality gap
- Inspired by empirical observation, reducing modality gap in feature space does not always lead to better downstream task performance
- Defined information gap to characterize gap of utility provided by two modalities towards predicting target variable
- Information gap only depends on joint distribution and is independent of modality encoders
- Information gap serves as lower bound of downstream prediction error if seeking to find features with zero modality gap
- Theorem 3.1 states that if information gap is large, optimal prediction error is at least โ p larger than input modalities
Method
- Propose three designs to construct latent modality structures
- Visualize designs in Fig. 3
- Adopt multi-modal training framework with contrastive loss
Intra-modality regularization via deep feature separation
- Aim to construct intra-modality structures to regularize in-modality representations
- Define two types of information: modality-shared and modality-independent
- Sub-optimal to match exact modality, so propose to explicitly model modality-independent information
- Use feature separation to construct new features to store information
- Constrain features to be orthogonal to original features
- Adopt contrastive and uniformity loss on independent features
- Preserve both modality-shared and modality-independent information
Inter-modality regularization via brownian bridge
- Regularizing inter-modality structures by constraining paired modality features in a subspace
- Constructing a latent structure to guide transition from image modality to associated text modality
- Modeling transition with Brownian bridge, which is a special type of Brownian motion
- Aligning augmented image feature with mean of Brownian bridge to fit the model
Intra-inter regularization via geometric consistency
- Aim to design a general regularizer that considers both intra-and inter-modality structures
- Enforce geometric symmetry within and between modality representations and their augmentations
- Optimize similarity between mismatched image and text pairs, and between image pairs and text pairs
- Optimize geometric symmetry between feature pairs and augmented feature pairs in the text and image space
Experiments
- Proposed methods are general purpose
- Evaluated with two popular multi-modal representation frameworks: two-tower based models and fusion based models
Two-tower-based models
- CLIP-based models used with two separate encoders to align features from image and text modalities
- Regularization losses applied along with standard contrastive loss for pre-training
- ResNet-50 and BERT used as image and text encoders respectively
- Reproduced CLIP results consistent with recent works
- Union of four datasets used for pre-training
- Zero-shot transfer on standard image classification tasks with CIFAR10, CIFAR100 and ImageNet1K
- Text prompts constructed using class name
- Results show importance of latent modality structures
- Performance degradation on natural distribution shift benchmarks
- All methods outperform baselines on linear probing tasks
Fusion-based models
- Tested methods on fusion-based models
- Used ALBEF framework to fuse modalities
- Fusion-based models are better at learning inter-model interaction
- Evaluated methods on various vision-language downstream tasks
- Used ViT-B/16 as vision encoder and 12layer BERT base as text encoder
- Results show 1-2% improvement on test sets
Conclusion
- Investigated latent modality structures in multi-modal representation learning
- Analyzed modality gap in latent feature space
- Revealed reducing modality gap to zero does not always lead to better performance
- Proposed three regularization methods to construct meaningful latent structures
- Deep feature separation loss
- Brownian bridge loss
- Geometric consistency loss
- Confirmed effectiveness and generalizability of proposed approach on popular contrastive representation learning frameworks
- Theorem 3.1 states that if modality encoders are perfectly aligned in feature space, joint mutual information is maximized
- Visualized constructing latent structures
- Visualized two-tower-based and fusion-based models
- Additional experimental results show all three regularizations improve performance