Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Proposes an efficient method to ground pretrained text-only language models to the visual domain
- Leverages abilities of language models learnt from large scale text-only pretraining
- Keeps language model frozen and finetunes input and output linear layers to enable cross-modality interactions
- Achieves strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue
Paper Content
Introduction
- LLMs are trained on text-only data and lack visual cues
- LLMs have limitations on tasks involving visual reasoning and grounding
- Propose a method to bootstrap a frozen LLM for processing and outputting multimodal data
- Model is efficient and requires less compute than existing models
- Model is capable of generating coherent multimodal outputs
- Model is capable of processing arbitrarily interleaved image-text inputs
- Model retains original abilities of text-only LLM to generate text
- Model attains new multimodal dialogue and reasoning abilities
- Model is model agnostic and can be applied to larger or stronger LLMs
- Model is more accurate on long and complex free-form text than existing models
Method
Model architecture
- Language model uses a byte-level BPE tokenizer to extract a sequence of input tokens from text.
- Visual model uses a pretrained visual backbone model to extract visual embeddings from an input image.
Translating between image-and-text
- Learn translation parameters to map between image and text embedding spaces
- Learn a linear mapping to map visual embeddings from the visual model
- Append [RET] token to model vocabulary to improve image-text retrieval performance
- Train linear mappings to map hidden representation of [RET] and visual embeddings into same retrieval space
Training setup
- Train FROMAGe with multi-task objective of image captioning and image-text retrieval
- Image captioning is generating text tokens conditioned on visual prefix
- Image-text retrieval used to learn joint visual and language representations
- Minimize InfoNCE loss for text-to-image and image-to-text retrieval
- Final training loss is weighted sum of captioning and retrieval losses
Data and implementation details
- Trained on Conceptual Captions (CC3M) dataset with 3.3 million image-text pairs
- Randomly concatenate distinct examples together to encourage model to attend to images
- Use publicly available OPT model with 6.7B parameters
- Use pretrained CLIP ViT-L/14 model for visual representations
- Models implemented in PyTorch v1.12 and trained mixed-precision with bfloat16
- Batch size of 180, 1 epoch, Adam optimizer with learning rate of 0.0003
- Visual prefix length of k = 1, retrieval embedding dimension q = 256, embedding dimension d = 1024
Experiments
- FROMAGe is useful for tasks with both image and text inputs and outputs, such as multimodal dialogue.
- Evaluation of FROMAGe focuses on image retrieval and image-and-text generation tasks.
Visual dialogue
- Evaluated FROMAGe on zero-shot Visual Dialog
- Tested ability to select correct text answer and retrieve correct image
- FROMAGe outperforms prior work on text-to-image task
Qualitative results
- FROMAGe is capable of learning in-context to perform zero-shot and few-shot tasks
- FROMAGe can produce interleaved images and texts as outputs
- FROMAGe can hold multimodal dialogue conversations
- FROMAGe can refine input images by compositing images and text concepts
- FROMAGe can answer questions that require specific real world facts
Analysis
- Analyzed various aspects of FROMAGe
- Trained models on CC3M for 24 hours on a single A6000 GPU
Ablation experiments
- Performed ablation experiments to validate design choices in FROMAGe
- Freezing language model is essential to retain in-context learning and fewshot generalization
- Finetuning decreases retrieval performance on VIST and VisDial
- Adding special [RET] token improves retrieval accuracy by 38.1%
The effect of context
- Multimodal context helps
- Increasing context from 1 to 5 captions improves model by 29%
- Adding 1 image and 2 captions improves model by 42%
- Performance steadily improves on image retrieval as more image and caption context is provided
- FROMAGe outperforms CLIP in all settings
In-context learning and text generation
- FROMAGe uses a frozen LLM as its backbone
- It is capable of in-context learning
- Generates new stories for VIST
- When prompted with full multimodal context, model is able to learn in-context
- Human evaluations to study effect of multimodal context on model generated stories
- Model generated story is rated as more relevant to image inputs compared to text-only setting
Conclusion
- Proposed a method to visually ground pretrained frozen language models
- Model, FROMAGe, is capable of producing coherent interleaved image-text outputs
- Showed strong zero-shot performance on a variety of tasks involving imagetext inputs and outputs
- Model contains knowledge about the world from pretraining
- Model is capable of reasoning about input images and responding with semantically appropriate images
- CM3 is the only prior work that can consume and produce arbitrarily interleaved images and text
- FROMAGe is more computationally efficient than CM3
- Ablation study showed that freezing the language model is important to retain the abilities learnt from pretraining
- Concatenating distinct examples during training was found to be helpful for downstream tasks
- Experiments with larger language models showed improved performance
- Human evaluations showed that generated outputs were better when conditioned on both images and text