Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Proposes an efficient method to ground pretrained text-only language models to the visual domain
  • Leverages abilities of language models learnt from large scale text-only pretraining
  • Keeps language model frozen and finetunes input and output linear layers to enable cross-modality interactions
  • Achieves strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue

Paper Content


  • LLMs are trained on text-only data and lack visual cues
  • LLMs have limitations on tasks involving visual reasoning and grounding
  • Propose a method to bootstrap a frozen LLM for processing and outputting multimodal data
  • Model is efficient and requires less compute than existing models
  • Model is capable of generating coherent multimodal outputs
  • Model is capable of processing arbitrarily interleaved image-text inputs
  • Model retains original abilities of text-only LLM to generate text
  • Model attains new multimodal dialogue and reasoning abilities
  • Model is model agnostic and can be applied to larger or stronger LLMs
  • Model is more accurate on long and complex free-form text than existing models


Model architecture

  • Language model uses a byte-level BPE tokenizer to extract a sequence of input tokens from text.
  • Visual model uses a pretrained visual backbone model to extract visual embeddings from an input image.

Translating between image-and-text

  • Learn translation parameters to map between image and text embedding spaces
  • Learn a linear mapping to map visual embeddings from the visual model
  • Append [RET] token to model vocabulary to improve image-text retrieval performance
  • Train linear mappings to map hidden representation of [RET] and visual embeddings into same retrieval space

Training setup

  • Train FROMAGe with multi-task objective of image captioning and image-text retrieval
  • Image captioning is generating text tokens conditioned on visual prefix
  • Image-text retrieval used to learn joint visual and language representations
  • Minimize InfoNCE loss for text-to-image and image-to-text retrieval
  • Final training loss is weighted sum of captioning and retrieval losses

Data and implementation details

  • Trained on Conceptual Captions (CC3M) dataset with 3.3 million image-text pairs
  • Randomly concatenate distinct examples together to encourage model to attend to images
  • Use publicly available OPT model with 6.7B parameters
  • Use pretrained CLIP ViT-L/14 model for visual representations
  • Models implemented in PyTorch v1.12 and trained mixed-precision with bfloat16
  • Batch size of 180, 1 epoch, Adam optimizer with learning rate of 0.0003
  • Visual prefix length of k = 1, retrieval embedding dimension q = 256, embedding dimension d = 1024


  • FROMAGe is useful for tasks with both image and text inputs and outputs, such as multimodal dialogue.
  • Evaluation of FROMAGe focuses on image retrieval and image-and-text generation tasks.

Visual dialogue

  • Evaluated FROMAGe on zero-shot Visual Dialog
  • Tested ability to select correct text answer and retrieve correct image
  • FROMAGe outperforms prior work on text-to-image task

Qualitative results

  • FROMAGe is capable of learning in-context to perform zero-shot and few-shot tasks
  • FROMAGe can produce interleaved images and texts as outputs
  • FROMAGe can hold multimodal dialogue conversations
  • FROMAGe can refine input images by compositing images and text concepts
  • FROMAGe can answer questions that require specific real world facts


  • Analyzed various aspects of FROMAGe
  • Trained models on CC3M for 24 hours on a single A6000 GPU

Ablation experiments

  • Performed ablation experiments to validate design choices in FROMAGe
  • Freezing language model is essential to retain in-context learning and fewshot generalization
  • Finetuning decreases retrieval performance on VIST and VisDial
  • Adding special [RET] token improves retrieval accuracy by 38.1%

The effect of context

  • Multimodal context helps
  • Increasing context from 1 to 5 captions improves model by 29%
  • Adding 1 image and 2 captions improves model by 42%
  • Performance steadily improves on image retrieval as more image and caption context is provided
  • FROMAGe outperforms CLIP in all settings

In-context learning and text generation

  • FROMAGe uses a frozen LLM as its backbone
  • It is capable of in-context learning
  • Generates new stories for VIST
  • When prompted with full multimodal context, model is able to learn in-context
  • Human evaluations to study effect of multimodal context on model generated stories
  • Model generated story is rated as more relevant to image inputs compared to text-only setting


  • Proposed a method to visually ground pretrained frozen language models
  • Model, FROMAGe, is capable of producing coherent interleaved image-text outputs
  • Showed strong zero-shot performance on a variety of tasks involving imagetext inputs and outputs
  • Model contains knowledge about the world from pretraining
  • Model is capable of reasoning about input images and responding with semantically appropriate images
  • CM3 is the only prior work that can consume and produce arbitrarily interleaved images and text
  • FROMAGe is more computationally efficient than CM3
  • Ablation study showed that freezing the language model is important to retain the abilities learnt from pretraining
  • Concatenating distinct examples during training was found to be helpful for downstream tasks
  • Experiments with larger language models showed improved performance
  • Human evaluations showed that generated outputs were better when conditioned on both images and text