Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Designed and trained a Generative Image-to-text Transformer (GIT) to unify vision-language tasks
  • Simplified architecture with one image encoder and one text decoder
  • Scaled up pre-training data and model size to boost performance
  • Established new state of the arts on 12 challenging benchmarks
  • Surpassed human performance on TextCaps
  • Presented a new scheme of generation-based image classification and scene text recognition

Paper Content

Introduction

  • Pre-training on large-scale image-text pairs
  • Masked Language Modeling (MLM) and Image-Text Matching (ITM) tasks used
  • Task-specific adaptation needed
  • Unified generative models for pre-training
  • Multi-modal encoder and text decoder with careful design
  • Generative Image-to-text Transformer (GIT) proposed
  • GIT achieves new state of the arts across numerous challenging benchmarks
  • Image encoder is a Swin-like vision transformer
  • Text decoder is a transformer network
  • Language modeling task used for pre-training
  • New generation-based scheme for ImageNet classification proposed

Network architecture

  • Image encoder based on contrastive pre-trained model
  • Input is raw image, output is 2D feature map
  • Extra linear layer and layernorm layer to project image features into D dimensions
  • Text decoder is transformer module with self-attention layer and feed-forward layer
  • Text tokenized and embedded into D dimensions
  • Image features concatenated with text embeddings as input to transformer module
  • Text decoder randomly initialized
  • Alternative architecture is cross-attention-based decoder
  • Self-attention-based decoder better with large-scale pre-training

Pre-training

  • Model is trained using language modeling (LM) loss
  • Alternative choice is MLM, which predicts 15% of input tokens
  • LM can predict all tokens, which is more efficient for large-scale pre-training data
  • Number of epochs is limited to 2 due to computational resource limitation
  • Model is similar to GPT3 in architecture wise

Fine-tuning

  • Applied same LM task to fine-tune GIT for image captioning
  • For VQA, question and answer concatenated as new caption during fine-tuning
  • Generative approach chosen over discriminative existing work
  • No OCR engine used, model learns to read scene text with pre-training
  • Simple architecture change for video domain
  • Generation model applied to image classification task

Experiments

Setting

  • 0.8B image-text pairs used for pre-training
  • Image encoder initialized from pre-trained contrastive model
  • Text decoder consists of 6 randomly-initialized transformer blocks
  • 0.7 billion model parameters
  • Learning rates of image encoder and decoder are 1e-5 and 5e-5 respectively
  • Total number of epochs is 2
  • Beam size is 4 and length penalty is 0.6 during inference

Results on image captioning and question answering

  • Evaluated captioning performance on Karpathy split of COCO and Flickr30K
  • Evaluated on nocaps, TextCaps, and VizWiz-Captions
  • Achieved new SOTA performance on all metrics except COCO Karpathy test
  • Model size is 0.7B, higher performance than CoCa (2.1B)
  • Outperformed previous SOTA on TextCaps by 28.5 points in CIDEr
  • Significantly benefited from more shots
  • Achieved new SOTA on VizWiz-VQA and OCR-VQA
  • Same performance as prior SOTA on ST-VQA
  • Higher accuracy on TextVQA, lower on VQAv2 than Flamingo
  • Model performs worse than discriminative model of Florence on VQAv2

Results on video captioning and question answering

  • Performance is evaluated on MSVD, MSRVTT, YouCook2, VATEX, and TVC
  • Results are shown in Table 5 and Table 6 for captioning and QA
  • Model is not as complex as Tang et al. (2021)
  • Model is better than Flamingo (Alayrac et al., 2022)

Results on image classification

  • GIT is fine-tuned on ImageNet-1k
  • GIT can achieve descent accuracy without pre-defining the vocabulary
  • GIT is worse than Florence by 1.2 points
  • GIT accuracy is 1.93% when exact match is required
  • GIT accuracy is 40.88% when prediction contains ground-truth
  • GIT accuracy is 33.48% when output tokens are limited to vocabulary
  • GIT accuracy is improved with 1 or 5 shots per category
  • GIT accuracy is higher than Flamingo
  • GIT requires lightweight fine-tuning once and no training shots during inference

Results on scene text recognition

  • Task aims to read scene text directly from image
  • Evaluated in two settings
  • Prediction is correct if caption is exact match to ground-truth
  • Model evaluated on 6 standard benchmarks
  • TextCaps-fine-tuned captioning model achieves 89.9 accuracy
  • GIT achieves 92.9, surpassing prior arts of 91.9

Analysis

  • Constructed two smaller pre-training datasets
  • Used 30 epochs for small-scale datasets
  • Named model as Huge
  • Image encoder replaced with ViT-B/16 and ViT-L/14
  • Performance drops with 0.8B data
  • All model variants benefit from more pre-training data
  • Difficult to effectively scale up text decoder

Scene text in pre-training data.

  • 15% of CC12M images contain scene text descriptions
  • 31% of downloaded images contain scene text descriptions
  • OCR result must be longer than 5 characters to be considered matched

Conclusion

  • Design and train a generative model to map input image to associated text description
  • Model achieves new state-of-the-art performance on image/video captioning and question answering tasks
  • Model surpasses human performance on TextCaps
  • Model used to predict label name directly
  • Limitations include difficulty controlling generated caption and in-context learning without parameter update
  • Model improves performance and is more appropriate to help visually-impaired people
  • Data preprocessing is faster than training and is overlapped with GPU training
  • Model is fine-tuned with 10 epochs, batch size of 512 and learning rate of 2.5e-6
  • Model achieves competitive performance with existing approaches
  • Model can identify novel objects without object tags as network input
  • Model is knowledgeable and can produce diverse and informative captions
  • Model is used to caption scene text and recognize novel concepts