Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Designed and trained a Generative Image-to-text Transformer (GIT) to unify vision-language tasks
Simplified architecture with one image encoder and one text decoder
Scaled up pre-training data and model size to boost performance
Established new state of the arts on 12 challenging benchmarks
Surpassed human performance on TextCaps
Presented a new scheme of generation-based image classification and scene text recognition

Paper Content

Introduction

Pre-training on large-scale image-text pairs
Masked Language Modeling (MLM) and Image-Text Matching (ITM) tasks used
Task-specific adaptation needed
Unified generative models for pre-training
Multi-modal encoder and text decoder with careful design
Generative Image-to-text Transformer (GIT) proposed
GIT achieves new state of the arts across numerous challenging benchmarks
Image encoder is a Swin-like vision transformer
Text decoder is a transformer network
Language modeling task used for pre-training
New generation-based scheme for ImageNet classification proposed

Network architecture

Image encoder based on contrastive pre-trained model
Input is raw image, output is 2D feature map
Extra linear layer and layernorm layer to project image features into D dimensions
Text decoder is transformer module with self-attention layer and feed-forward layer
Text tokenized and embedded into D dimensions
Image features concatenated with text embeddings as input to transformer module
Text decoder randomly initialized
Alternative architecture is cross-attention-based decoder
Self-attention-based decoder better with large-scale pre-training

Pre-training

Model is trained using language modeling (LM) loss
Alternative choice is MLM, which predicts 15% of input tokens
LM can predict all tokens, which is more efficient for large-scale pre-training data
Number of epochs is limited to 2 due to computational resource limitation
Model is similar to GPT3 in architecture wise

Fine-tuning

Applied same LM task to fine-tune GIT for image captioning
For VQA, question and answer concatenated as new caption during fine-tuning
Generative approach chosen over discriminative existing work
No OCR engine used, model learns to read scene text with pre-training
Simple architecture change for video domain
Generation model applied to image classification task

Experiments

Setting

0.8B image-text pairs used for pre-training
Image encoder initialized from pre-trained contrastive model
Text decoder consists of 6 randomly-initialized transformer blocks
0.7 billion model parameters
Learning rates of image encoder and decoder are 1e-5 and 5e-5 respectively
Total number of epochs is 2
Beam size is 4 and length penalty is 0.6 during inference

Results on image captioning and question answering

Evaluated captioning performance on Karpathy split of COCO and Flickr30K
Evaluated on nocaps, TextCaps, and VizWiz-Captions
Achieved new SOTA performance on all metrics except COCO Karpathy test
Model size is 0.7B, higher performance than CoCa (2.1B)
Outperformed previous SOTA on TextCaps by 28.5 points in CIDEr
Significantly benefited from more shots
Achieved new SOTA on VizWiz-VQA and OCR-VQA
Same performance as prior SOTA on ST-VQA
Higher accuracy on TextVQA, lower on VQAv2 than Flamingo
Model performs worse than discriminative model of Florence on VQAv2

Results on video captioning and question answering

Performance is evaluated on MSVD, MSRVTT, YouCook2, VATEX, and TVC
Results are shown in Table 5 and Table 6 for captioning and QA
Model is not as complex as Tang et al. (2021)
Model is better than Flamingo (Alayrac et al., 2022)

Results on image classification

GIT is fine-tuned on ImageNet-1k
GIT can achieve descent accuracy without pre-defining the vocabulary
GIT is worse than Florence by 1.2 points
GIT accuracy is 1.93% when exact match is required
GIT accuracy is 40.88% when prediction contains ground-truth
GIT accuracy is 33.48% when output tokens are limited to vocabulary
GIT accuracy is improved with 1 or 5 shots per category
GIT accuracy is higher than Flamingo
GIT requires lightweight fine-tuning once and no training shots during inference

Results on scene text recognition

Task aims to read scene text directly from image
Evaluated in two settings
Prediction is correct if caption is exact match to ground-truth
Model evaluated on 6 standard benchmarks
TextCaps-fine-tuned captioning model achieves 89.9 accuracy
GIT achieves 92.9, surpassing prior arts of 91.9

Analysis

Constructed two smaller pre-training datasets
Used 30 epochs for small-scale datasets
Named model as Huge
Image encoder replaced with ViT-B/16 and ViT-L/14
Performance drops with 0.8B data
All model variants benefit from more pre-training data
Difficult to effectively scale up text decoder

Scene text in pre-training data.

15% of CC12M images contain scene text descriptions
31% of downloaded images contain scene text descriptions
OCR result must be longer than 5 characters to be considered matched

Conclusion

Design and train a generative model to map input image to associated text description
Model achieves new state-of-the-art performance on image/video captioning and question answering tasks
Model surpasses human performance on TextCaps
Model used to predict label name directly
Limitations include difficulty controlling generated caption and in-context learning without parameter update
Model improves performance and is more appropriate to help visually-impaired people
Data preprocessing is faster than training and is overlapped with GPU training
Model is fine-tuned with 10 epochs, batch size of 512 and learning rate of 2.5e-6
Model achieves competitive performance with existing approaches
Model can identify novel objects without object tags as network input
Model is knowledgeable and can produce diverse and informative captions
Model is used to caption scene text and recognize novel concepts

Link to paper#

Abstract#

Paper Content#

Introduction#

Network architecture#

Pre-training#

Fine-tuning#

Experiments#

Setting#

Results on image captioning and question answering#

Results on video captioning and question answering#

Results on image classification#

Results on scene text recognition#

Analysis#

Scene text in pre-training data.#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Network architecture

Pre-training

Fine-tuning

Experiments

Setting

Results on image captioning and question answering

Results on video captioning and question answering

Results on image classification

Results on scene text recognition

Analysis

Scene text in pre-training data.

Conclusion