Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Proposes an efficient method to ground pretrained text-only language models to the visual domain
Leverages abilities of language models learnt from large scale text-only pretraining
Keeps language model frozen and finetunes input and output linear layers to enable cross-modality interactions
Achieves strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue

Paper Content

Introduction

LLMs are trained on text-only data and lack visual cues
LLMs have limitations on tasks involving visual reasoning and grounding
Propose a method to bootstrap a frozen LLM for processing and outputting multimodal data
Model is efficient and requires less compute than existing models
Model is capable of generating coherent multimodal outputs
Model is capable of processing arbitrarily interleaved image-text inputs
Model retains original abilities of text-only LLM to generate text
Model attains new multimodal dialogue and reasoning abilities
Model is model agnostic and can be applied to larger or stronger LLMs
Model is more accurate on long and complex free-form text than existing models

Method

Model architecture

Language model uses a byte-level BPE tokenizer to extract a sequence of input tokens from text.
Visual model uses a pretrained visual backbone model to extract visual embeddings from an input image.

Translating between image-and-text

Learn translation parameters to map between image and text embedding spaces
Learn a linear mapping to map visual embeddings from the visual model
Append [RET] token to model vocabulary to improve image-text retrieval performance
Train linear mappings to map hidden representation of [RET] and visual embeddings into same retrieval space

Training setup

Train FROMAGe with multi-task objective of image captioning and image-text retrieval
Image captioning is generating text tokens conditioned on visual prefix
Image-text retrieval used to learn joint visual and language representations
Minimize InfoNCE loss for text-to-image and image-to-text retrieval
Final training loss is weighted sum of captioning and retrieval losses

Data and implementation details

Trained on Conceptual Captions (CC3M) dataset with 3.3 million image-text pairs
Randomly concatenate distinct examples together to encourage model to attend to images
Use publicly available OPT model with 6.7B parameters
Use pretrained CLIP ViT-L/14 model for visual representations
Models implemented in PyTorch v1.12 and trained mixed-precision with bfloat16
Batch size of 180, 1 epoch, Adam optimizer with learning rate of 0.0003
Visual prefix length of k = 1, retrieval embedding dimension q = 256, embedding dimension d = 1024

Experiments

FROMAGe is useful for tasks with both image and text inputs and outputs, such as multimodal dialogue.
Evaluation of FROMAGe focuses on image retrieval and image-and-text generation tasks.

Visual dialogue

Evaluated FROMAGe on zero-shot Visual Dialog
Tested ability to select correct text answer and retrieve correct image
FROMAGe outperforms prior work on text-to-image task

Qualitative results

FROMAGe is capable of learning in-context to perform zero-shot and few-shot tasks
FROMAGe can produce interleaved images and texts as outputs
FROMAGe can hold multimodal dialogue conversations
FROMAGe can refine input images by compositing images and text concepts
FROMAGe can answer questions that require specific real world facts

Analysis

Analyzed various aspects of FROMAGe
Trained models on CC3M for 24 hours on a single A6000 GPU

Ablation experiments

Performed ablation experiments to validate design choices in FROMAGe
Freezing language model is essential to retain in-context learning and fewshot generalization
Finetuning decreases retrieval performance on VIST and VisDial
Adding special [RET] token improves retrieval accuracy by 38.1%

The effect of context

Multimodal context helps
Increasing context from 1 to 5 captions improves model by 29%
Adding 1 image and 2 captions improves model by 42%
Performance steadily improves on image retrieval as more image and caption context is provided
FROMAGe outperforms CLIP in all settings

In-context learning and text generation

FROMAGe uses a frozen LLM as its backbone
It is capable of in-context learning
Generates new stories for VIST
When prompted with full multimodal context, model is able to learn in-context
Human evaluations to study effect of multimodal context on model generated stories
Model generated story is rated as more relevant to image inputs compared to text-only setting

Conclusion

Proposed a method to visually ground pretrained frozen language models
Model, FROMAGe, is capable of producing coherent interleaved image-text outputs
Showed strong zero-shot performance on a variety of tasks involving imagetext inputs and outputs
Model contains knowledge about the world from pretraining
Model is capable of reasoning about input images and responding with semantically appropriate images
CM3 is the only prior work that can consume and produce arbitrarily interleaved images and text
FROMAGe is more computationally efficient than CM3
Ablation study showed that freezing the language model is important to retain the abilities learnt from pretraining
Concatenating distinct examples during training was found to be helpful for downstream tasks
Experiments with larger language models showed improved performance
Human evaluations showed that generated outputs were better when conditioned on both images and text

Link to paper#

Abstract#

Paper Content#

Introduction#

Method#

Model architecture#

Translating between image-and-text#

Training setup#

Data and implementation details#

Experiments#

Visual dialogue#

Qualitative results#

Analysis#

Ablation experiments#

The effect of context#

In-context learning and text generation#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Method

Model architecture

Translating between image-and-text

Training setup

Data and implementation details

Experiments

Visual dialogue

Qualitative results

Analysis

Ablation experiments

The effect of context

In-context learning and text generation

Conclusion